Background

The file ‘WeatherDownloads_202005_v002.Rmd’ contains code for dowloading and processing historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded and processed for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform exploratory data analysis on the processed weather files.

Data Availability

Each processed data file contains one year of hourly weather data for one station. Files are saved as ‘./RInputFiles/ProcessedMETAR/metar_kxxx_yyyy.rds’ where xxx is the three-digit airport code and yyyy is the four-digit year.

Each file contains the following variables:

  • METAR (chr) - the extracted portion of the METAR based on a regex string
  • WindDir (chr) - the previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • Dummy (chr) - artifact, always a blank space
  • Visibility (dbl) - surface visibility in statute miles
  • TempC (int) - temperature in degrees Celsius
  • DewC (int) - dew point in degrees Celsius
  • Altimeter (int) - altimeter in inches of mercury
  • SLP (int) - the raw sea-level-pressure reading from the METAR
  • FahrC (chr) - the raw temperature string pulled from the METAR (Tttttdddd) where tttt is the Fahrenheit temperature recorded in Celsius and dddd is the Fahrenheit dew point recorded in Celsius
  • dtime (dttm) - the date-time associated with the observation
  • origMETAR (chr) - the full METAR associated with the observation
  • TempF (dbl) - the Fahrenheit temperature associated with converting FahrC to Fahrenheit
  • DewF (dbl) - the Fahrenheit dew point associated with converting FahrC to Fahrenheit
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • nSKC (int) - number of times ‘SKC’ (human-confirmed cloud-free) is recorded in the observation (should be 0 or 1)
  • nCLR (int) - number of times ‘CLR’ (austomated-sensor cloud-free) is recorded in the observation (should be 0 or 1, and should never have both nSKC>0 and nCLR>0)
  • cloudn (chr) - the nth cloud layer recorded in the METAR (layers begin with FEW, SCT, BKN, OVC or VV)
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • wType (fct) - highest level of obscuration recorded in the METAR (VV > OVC > BKN > SCT > FEW > CLR/SKC)
  • year (dbl) - year of the observation
  • monthint (dbl) - month of the observation as a number (e.g., 6=June)
  • month (fct) - month of the observation as three-character abbreviation, saved as a factor (e.g., Jun=June)
  • day (int) - day of the month of the observation

Base Functions Available

There are several functions available for analysis:

  • plotCountsByMetric() - bar plots for counts by variable

  • plotNumCor() - plot two numeric variables against each other

  • plotFactorNumeric() - boxplot a numeric variable against a factor variable

  • corMETAR() - correlations between METAR variables

  • lmMETAR() - linear regression modeling for METAR variables

  • basicWindPlots() - plot wind speed and direction

  • getWindDirGroup() - convert wind direction to a grouping (e.g., N for 320-360-40)

  • consolidatePlotWind() - show frequency plots of wind direction, city, and month

The tidyverse library is loaded and the 2016 Detroit data is read in to show examples of the functions:

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
kdtw_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kdtw_2016.rds")

A variable mapping is created to allow for better readable variable names:

varMapper <- c(WindDir="Wind Direction (degrees)", 
               predomDir="General Prevailing Wind Direction",
               WindSpeed="Wind Speed (kts)",
               WindSpeed5="Wind Speed (kts), rounded to nearest 5 knots", 
               Visibility="Visibility (SM)", 
               TempC="Temperature (C)", 
               DewC="Dew Point (C)", 
               Altimeter="Altimeter (inches Hg)",
               Altimeter10="Altimeter (inches Hg), rounded to nearest 0.1 inHg", 
               modSLP="Sea-Level Pressure (hPa)", 
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               TempF5="Temperature (F), rounded to nearest 5 degrees",
               DewF5="Dew Point (F), rounded to nearest 5 degrees", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               month="Month", 
               year="Year",
               wType="Greatest Sky Obscuration", 
               day="Day of Month"
               )

The function plotCountsByMetric() produces bar plots for counts by variable:

# Helper function for generating plots by key variables
plotcountsByMetric <- function(df, 
                               mets, 
                               title="", 
                               rotateOn=20, 
                               dropNA=TRUE, 
                               diagnose=FALSE,
                               mapper=varMapper,
                               facetOn=NULL, 
                               showCentral=FALSE
                               ) {
    
    # Function arguments
    # df: dataframe or tibble containing raw data
    # mets: character vector of variables for plotting counts
    # title: character vector for plot title
    # rotateOn: integer, x-axis labels will be rotated by 90 degrees if # categories >= rotateOn
    # dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied df (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function plots overall counts by metric for a given input
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall mean will be plotted as a point on the main plot (only makes sense if facetOn has been selected)
    
    
    # Plot of counts by key metric
    for (x in mets) {
        # If a facetting variable is provided, need to include this in the group_by
        useVars <- x
        if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
        dat <- df %>%
            group_by_at(vars(all_of(useVars))) %>%
            summarize(n=n())
        
        if (dropNA) {
            nOrig <- nrow(dat)
            sumOrig <- sum(dat$n)
            dat <- dat %>%
                filter_all(all_vars(!is.na(.)))
            if (diagnose & (nOrig > nrow(dat))) { 
                cat("\nDropping", 
                    nOrig-nrow(dat), 
                    "rows with", 
                    sumOrig-sum(dat$n), 
                    "observations due to NA\n"
                    )
            }
        }
        
        # Create the main plot
        p <- dat %>%
            ggplot(aes_string(x=x, y="n")) + 
            geom_col() + 
            labs(title=title,
                 subtitle=paste0("Counts By: ", mapper[x]), 
                 x=paste0(x, " - ", mapper[x]),
                 y="Count"
                 )
        # If the rotateOn criteria is exceeded, rotate the x-axis by 90 degrees
        if (nrow(dat) >= rotateOn) {
            p <- p + theme(axis.text.x=element_text(angle=90))
        }
        # If facetting has been requested, facet by the desired variable
        if (!is.null(facetOn)) {
            p <- p + facet_wrap(as.formula(paste("~", facetOn)))
        }
        # If showCentral=TRUE, add a dot plot for the overall average
        if (showCentral) {
            # Get the median number of observations by facet, or the total if facetOn=NULL
            if (is.null(facetOn)) {
                useN <- sum(dat$n)
            } else {
                useN <- dat %>%
                    group_by_at(vars(all_of(facetOn))) %>%
                    summarize(n=sum(n)) %>%
                    pull(n) %>%
                    median()
            }
            # Get the overall percentages by x
            centralData <- helperCountsByMetric(tbl=dat, ctVar=x, sumOn="n") %>%
                mutate(centralValue=nPct*useN)
            # Apply the median
            p <- p + geom_point(data=centralData, aes(y=centralValue), color="red", size=2)
        }
        # Print the plot
        print(p)
    }
}

# Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Detroit, MI (2016)"
                   )

The function plotNumCor() plots two numeric variables against one another:

# Create a function for plotting two variables against each other
plotNumCor <- function(met, 
                       var1, 
                       var2, 
                       title=NULL, 
                       subT="", 
                       dropNA=TRUE, 
                       diagnose=FALSE,
                       mapper=varMapper, 
                       facetOn=NULL, 
                       showCentral=FALSE
                       ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # var1: character vector of variable to be used for the x-axis
    # var2: character vector of variable to be used for the y-axis
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function plots overall counts by the provided x/y metrics, with each point sized based on the number of observations, and with an lm smooth overlaid
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the lm smooth that best first to the overall data will be plotted (only makes sense if facetOn has been selected)
    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[var1], " and ", mapper[var2]) 
    }

    # If a facetting variable is provided, need to include this in the group_by
    useVars <- c(var1, var2)
    if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
        
    # Pull the counts by useVars
    dat <- met %>%
        group_by_at(vars(all_of(useVars))) %>%
        summarize(n=n()) 
    
    # If NA requested to be excluded, remove anything with NA
    if (dropNA) {
        nOrig <- nrow(dat)
        sumOrig <- sum(dat$n)
        dat <- dat %>%
            filter_all(all_vars(!is.na(.)))
        if (diagnose) { 
            cat("\nDropping", 
                nOrig-nrow(dat), 
                "rows with", 
                sumOrig-sum(dat$n), 
                "observations due to NA\n"
                )
        }
    }
    
    p <- dat %>%
        ggplot(aes_string(x=var1, y=var2)) + 
        geom_point(alpha=0.5, aes_string(size="n")) + 
        geom_smooth(method="lm", aes_string(weight="n")) + 
        labs(x=paste0(mapper[var1], " - ", var1), 
             y=paste0(mapper[var2], " - ", var2), 
             title=title, 
             subtitle=subT
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    # If showCentral=TRUE, add a dashed line for the overall data
    if (showCentral) {
        p <- p + helperNumCor(dat, xVar=var1, yVar=var2, sumOn="n")
    }
    
    print(p)
}

# Example for Detroit 2016 - using TempC and TempF
plotNumCor(kdtw_2016, var1="TempC", var2="TempF", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Example for Detroit 2016 - using Altimeter and modSLP
plotNumCor(kdtw_2016, var1="Altimeter", var2="modSLP", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

The function plotFactorNumeric() creates box plots for a numeric variable against a factor variable:

# Updated function for plotting numeric by factor
plotFactorNumeric <- function(met, 
                              fctVar, 
                              numVar, 
                              title=NULL, 
                              subT="", 
                              diagnose=TRUE,
                              showXLabel=TRUE,
                              mapper=varMapper,
                              facetOn=NULL, 
                              showCentral=FALSE
                              ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
    # numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    
    # Function usage
    # 1.  By default, the function creates the boxplot of numVar by fctVar
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot

    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
    }
    
    # Remove the NA variables
    nOrig <- nrow(met)
    dat <- met %>%
        filter(!is.na(get(fctVar)), !is.na(get(numVar)))
    if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
    
    # Create the base plot
    p <- dat %>%
        ggplot(aes_string(x=fctVar, y=numVar)) + 
        geom_boxplot(fill="lightblue") + 
        labs(title=title, 
             subtitle=subT, 
             x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""), 
             y=paste0(mapper[numVar], " - ", numVar)
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    
    # If showCentral=TRUE, add a dot plot for the overall average
    if (showCentral) {
        centData <- helperFactorNumeric(dat, .f=median, byVar=fctVar, numVar=numVar)
        p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
    }

    # Render the final plot
    print(p)
    
}

# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=FALSE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Example for Detroit 2016 - using WindSpeed and wType
plotFactorNumeric(kdtw_2016, 
                  fctVar="wType", 
                  numVar="WindSpeed", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=TRUE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Example for Detroit 2016 - using Visibility and wType
plotFactorNumeric(kdtw_2016, 
                  fctVar="wType", 
                  numVar="Visibility", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=TRUE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

An issue previous observed where visibility 1/16SM was interpreted as 16 statutory miles has been corrected in the ‘WeatherDownloads_202005_v002’ file.

The function corMETAR() calculates correlations among numeric variables in the METAR data:

# Function to calculate, display, and plot variable correlations
corMETAR <- function(met, numVars, subT="") {

    # Keep only complete cases and report on data kept
    dfUse <- met %>%
        select_at(vars(all_of(numVars))) %>%
        filter(complete.cases(.))
    
    nU <- nrow(dfUse)
    nM <- nrow(met)
    myPct <- round(100*nU/nM, 1)
    cat("\n *** Correlations use ", nU, " complete cases (", myPct, "% of ", nM, " total) ***\n", sep="")
    
    # Create the correlation matrix
    mtxCorr <- dfUse %>%
        cor()

    # Print the correlations
    mtxCorr %>%
        round(2) %>%
        print()

    # Display a heat map
    corrplot::corrplot(mtxCorr, 
                       method="color", 
                       title=paste0("Hourly Weather Correlations\n", subT), 
                       mar=c(0, 0, 2, 0)
                       )
}

# Example for Detroit, MI 2016
coreNum <- c("TempC", "TempF", "DewC", "DewF", 
             "Altimeter", "modSLP", "WindSpeed", "Visibility"
             )
corMETAR(kdtw_2016, numVars=coreNum, subT="Detroit, MI (2016) METAR")
## 
##  *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## TempF       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## DewC        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## DewF        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## Altimeter  -0.17 -0.17 -0.22 -0.22      1.00   1.00     -0.37       0.15
## modSLP     -0.24 -0.24 -0.28 -0.28      1.00   1.00     -0.35       0.14
## WindSpeed  -0.10 -0.10 -0.18 -0.18     -0.37  -0.35      1.00       0.10
## Visibility  0.14  0.14 -0.07 -0.07      0.15   0.14      0.10       1.00

The function lmMETAR() runs simple linear regression models on the METAR data:

# Function for linear regressions on METAR data
lmMETAR <- function(met, 
                    y, 
                    x, 
                    yName, 
                    subT=""
                    ) {
    
    # Convert to formula
    myChar <- paste0(y, " ~ ", x)
    cat("\n *** Regression call is:", myChar, "***\n")
    
    # Run regression
    regr <- lm(formula(myChar), data=met)
    
    # Summarize regression
    print(summary(regr))
    
    # Predict the new values
    pred <- predict(regr, newdata=met)
    
    # Plot the predictions
    p <- met %>%
        select_at(vars(all_of(y))) %>%
        mutate(pred=pred) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(vars(all_of(c(y, "pred")))) %>%
        summarize(n=n()) %>%
        ggplot(aes_string(x=y, y="pred")) + 
        geom_point(aes(size=n), alpha=0.25) + 
        geom_smooth(aes(weight=n), method="lm") + 
        labs(title=paste0("Predicted vs. Actual ", yName, " - ", x, " as Predictor"), 
             subtitle=subT, 
             x=paste0("Actual ", yName), 
             y=paste0("Predicted ", yName)
             )
    print(p)
    
}

# Examples for Detroit, MI 2016
lmMETAR(kdtw_2016, "modSLP", "Altimeter", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96361 -0.44022 -0.03758  0.40772  1.41448 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.44389    0.74723  -22.01   <2e-16 ***
## Altimeter    34.41514    0.02488 1383.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4961 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9954 
## F-statistic: 1.913e+06 on 1 and 8767 DF,  p-value: < 2.2e-16

lmMETAR(kdtw_2016, "modSLP", "Altimeter + TempF", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57802 -0.12674  0.00481  0.12820  0.65708 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.713e+00  2.803e-01  -13.25   <2e-16 ***
## Altimeter    3.403e+01  9.302e-03 3658.80   <2e-16 ***
## TempF       -2.351e-02  9.941e-05 -236.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1826 on 8766 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.086e+06 on 2 and 8766 DF,  p-value: < 2.2e-16

lmMETAR(kdtw_2016, "TempC", "DewC", yName="Temperature (C)", subT="Detroit, MI (2016)")
## 
##  *** Regression call is: TempC ~ DewC ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.174 -3.172 -1.165  2.822 17.828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.17084    0.05373   114.9   <2e-16 ***
## DewC         0.99909    0.00470   212.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.463 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8375 
## F-statistic: 4.519e+04 on 1 and 8767 DF,  p-value: < 2.2e-16

The basicWindPlots() function creates plots for wind speed and direction:

# Generate basic wind plots
basicWindPlots <- function(met, 
                           dirVar="WindDir", 
                           spdVar="WindSpeed",
                           desc="", 
                           gran="", 
                           mapper=varMapper
                           ) {

    # Plot for the wind direction
    wDir <- met %>%
        ggplot(aes_string(x=dirVar)) + 
        geom_bar() + 
        labs(title=paste0(desc, " Wind Direction"), subtitle=gran, 
             y="# Hourly Observations", x=mapper[dirVar]
             ) + 
        theme(axis.text.x=element_text(angle=90))
    print(wDir)

    # Plot for the minimum, average, and maximum wind speed by wind direction
    # Wind direction 000 is reserved for 0 KT wind, while VRB is reserved for 3-6 KT variable winds
    wSpeedByDir <- met %>%
        filter(!is.na(get(dirVar))) %>%
        group_by_at(vars(all_of(dirVar))) %>%
        summarize(minWind=min(get(spdVar)), meanWind=mean(get(spdVar)), maxWind=max(get(spdVar))) %>%
        ggplot(aes_string(x=dirVar)) +
        geom_point(aes(y=meanWind), color="red", size=2) +
        geom_errorbar(aes(ymin=minWind, ymax=maxWind)) +
        labs(title=paste0(desc, " Wind Speed (Max, Mean, Min) By Wind Direction"), 
             subtitle=gran,
             y=mapper[spdVar], 
             x=mapper[dirVar]
             ) + 
        theme(axis.text.x=element_text(angle=90))
    print(wSpeedByDir)

    # Plot for the wind speed
    pctZero <- sum(pull(met, spdVar)==0, na.rm=TRUE) / nrow(met)
    wSpeed <- met %>%
        filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
        ggplot(aes_string(x=spdVar)) +
        geom_bar(aes(y=..count../sum(..count..))) +
        labs(title=paste0(round(100*pctZero), "% of wind speeds in ", desc, " measure 0 Knots"),
             subtitle=gran,
             y="% Hourly Observations", 
             x=mapper[spdVar]
             )
    print(wSpeed)
    
    # Polar plot for wind speed and wind direction
    wData <- met %>%
        filter_at(vars(all_of(dirVar)), all_vars(!is.na(.) & !(. %in% c("000", "VRB")))) %>%
        filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
        mutate_at(vars(all_of(dirVar)), as.numeric) %>%
        group_by_at(vars(all_of(c(dirVar, spdVar)))) %>%
        summarize(n=n())
        
    wPolarDirSpeed <- wData %>%
        ggplot(aes_string(x=spdVar, y=dirVar)) +
        geom_point(alpha=0.1, aes(size=n)) +
        coord_polar(theta="y") +
        labs(title=paste0(desc, " Direction vs. Wind Speed"), 
             subtitle=gran, 
             x=mapper[spdVar], 
             y=mapper[dirVar]
             ) +
        scale_y_continuous(limits=c(0, 360), breaks=c(0, 90, 180, 270, 360)) +
        scale_x_continuous(limits=c(0, 40), breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40)) +
        geom_point(aes(x=0, y=0), color="red", size=2)
    print(wPolarDirSpeed)

}

# Example for Detroit, MI 2016
basicWindPlots(kdtw_2016, desc="Detroit, MI (2016)", gran="KDTW METAR")

The getWindDirGroup() function maps wind direction to a category such as NNE. Because the METAR data are recorded in units of 10 degrees, either 4 groupings (90 degrees each) or 12 groupings (30 degrees each) are preferred, so that each category has the same underlying number of buckets:

# Extract the wind direction data from a processed METAR file
getWindDirGroup <- function(met, src) {
    
    # Use the fullMETAR data and extract WindDir, WindSpeed, month
    windPlotData <- met %>%
        select(WindDir, WindSpeed, month) %>%
        mutate(windDirGroup=factor(case_when(WindSpeed==0 ~ "No Wind", 
                                             WindDir=="VRB" ~ "Variable", 
                                             WindDir %in% c("350", "360", "010") ~ "N", 
                                             WindDir %in% c("020", "030", "040") ~ "NNE", 
                                             WindDir %in% c("050", "060", "070") ~ "ENE", 
                                             WindDir %in% c("080", "090", "100") ~ "E", 
                                             WindDir %in% c("110", "120", "130") ~ "ESE",
                                             WindDir %in% c("140", "150", "160") ~ "SSE", 
                                             WindDir %in% c("170", "180", "190") ~ "S", 
                                             WindDir %in% c("200", "210", "220") ~ "SSW",
                                             WindDir %in% c("230", "240", "250") ~ "WSW", 
                                             WindDir %in% c("260", "270", "280") ~ "W", 
                                             WindDir %in% c("290", "300", "310") ~ "WNW", 
                                             WindDir %in% c("320", "330", "340") ~ "NNW", 
                                             TRUE ~ "Error"
                                             ), 
                                   levels=c("No Wind", "Variable", "Error", 
                                            "N", "NNE", "ENE", 
                                            "E", "ESE", "SSE", 
                                            "S", "SSW", "WSW", 
                                            "W", "WNW", "NNW"
                                            )
                                   )
               )
    
    # Rempve the errors and calculate percentages by month for the remainder
    processedWindData <- windPlotData %>%
        filter(windDirGroup != "Error") %>%
        group_by(month, windDirGroup) %>%
        summarize(n=n()) %>%
        ungroup() %>%
        group_by(month) %>%
        mutate(pct=n/sum(n)) %>%
        ungroup() %>%
        mutate(src=src)
    
    processedWindData

}

The function conslidatePlotWind() then calls getWindDirGroup() for any number of files:

# Consolidate and plot wind data
consolidatePlotWind <- function(files, names) {

    consFun <- function(x, y) { getWindDirGroup(met=x, src=y) }
    boundByRows <- map2_dfr(.x=files, .y=names, .f=consFun)

    # Show frequency by month for each city, faceted by wind direction
    p1 <- boundByRows %>%
        ggplot(aes(x=month, y=pct, color=src)) + 
        geom_line(aes(group=src)) + 
        facet_wrap(~windDirGroup) + 
        labs(title="Wind Frequency by Month", 
             x="Month", 
             y="Frequency of Wind Observations"
             ) +
        theme(axis.text.x=element_text(angle=90))
    print(p1)
    
    # Show frequency by wind direction for each city, faceted by month
    p2 <- boundByRows %>%
        ggplot(aes(x=windDirGroup, y=pct, color=src)) + 
        geom_line(aes(group=src)) + 
        facet_wrap(~month) + 
        labs(title="Wind Frequency by Wind Direction", 
             x="Wind Direction", 
             y="Frequency of Wind Observations"
             ) +
        theme(axis.text.x=element_text(angle=90))
    print(p2)
    
    boundByRows
    
}

# Load the Las Vegas data and New Orleans data for comparison
kmsy_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kmsy_2016.rds")
klas_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_klas_2016.rds")

# Run wind by month comparisons for Detroit, Las Vegas, New Orleans
consolidatePlotWind(files=list(kdtw_2016, klas_2016, kmsy_2016), 
                    names=c("Detroit, MI (2016)", "Las Vegas, NV (2016)", "New Orleans, LA (2016)")
                    )

## # A tibble: 504 x 5
##    month windDirGroup     n     pct src               
##    <fct> <fct>        <int>   <dbl> <chr>             
##  1 Jan   No Wind         46 0.0600  Detroit, MI (2016)
##  2 Jan   Variable         3 0.00391 Detroit, MI (2016)
##  3 Jan   N               26 0.0339  Detroit, MI (2016)
##  4 Jan   NNE             26 0.0339  Detroit, MI (2016)
##  5 Jan   ENE              9 0.0117  Detroit, MI (2016)
##  6 Jan   E               16 0.0209  Detroit, MI (2016)
##  7 Jan   ESE             14 0.0183  Detroit, MI (2016)
##  8 Jan   SSE             72 0.0939  Detroit, MI (2016)
##  9 Jan   S              131 0.171   Detroit, MI (2016)
## 10 Jan   SSW            158 0.206   Detroit, MI (2016)
## # ... with 494 more rows

Combining Functions

The functions can be combined so that a full process can be run for a given file:

# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# This is a helper function to create a locale description
getLocaleDescription <- function(x, mapper=cityNameMapper) {
    
    # Initialize the description as NULL
    desc <- NULL
    
    for (potMatch in names(mapper)) {
        if (str_detect(string=x, pattern=potMatch)) {
            desc <- mapper[potMatch]
            break
        }
    }
    
    # If the mapping failed, use UNMAPPED_x as the description
    if (is.null(desc)) {
        desc <- paste0("UNMAPPED_", x)
        cat("\nUnable to find a description, will use ", desc, "\n\n", sep="")
    } else {
        cat("\nWill use ", desc, " as the description for ", x, "\n\n", sep="")
    }
    
    # Return the descriptive name
    desc
    
}

# The following function runs the functions that work on a single data source
combinedEDA <- function(filename=NULL, 
                        tbl=NULL,
                        desc=NULL,
                        mets=c("WindDir", "WindSpeed", "TempC", "DewC", "Altimeter", 
                               "modSLP", "cType1", "cLevel1", "month", "day"
                               ),
                        corPairs=list(c("TempC", "TempF"), 
                                      c("TempC", "DewC"), 
                                      c("Altimeter", "modSLP"), 
                                      c("Altimeter", "WindSpeed")
                                      ),
                        fctPairs=list(c("month", "TempF"), 
                                      c("month", "DewF"), 
                                      c("month", "WindSpeed"), 
                                      c("month", "Altimeter"), 
                                      c("wType", "Visibility"), 
                                      c("wType", "WindSpeed"), 
                                      c("WindDir", "WindSpeed"), 
                                      c("WindDir", "TempF")
                                      ),
                        heatVars=c("TempC", "TempF", 
                                   "DewC", "DewF", 
                                   "Altimeter", "modSLP", 
                                   "WindSpeed", "Visibility", 
                                   "monthint", "day"
                                   ),
                        lmVars=list(c("modSLP", "Altimeter"), 
                                    c("modSLP", "Altimeter + TempF"), 
                                    c("TempF", "DewF"), 
                                    c("WindSpeed", "Altimeter + TempF + DewF + month")
                                    ),
                        mapVariables=varMapper,
                        mapFileNames=cityNameMapper,
                        path="./RInputFiles/ProcessedMETAR/"
                        ) {
    
    # Require that either filename OR tbl be passed
    if (is.null(filename) & is.null(tbl)) {
        cat("\nMust provide either a filename or an already-loaded tibble\n")
        stop("\nfilename=NULL and tbl=NULL may not both be passed to combinedEDA()\n")
    }
    
    # Require that either 1) filename and mapFileNames, OR 2) desc be passed
    if ((is.null(filename) | is.null(mapFileNames)) & is.null(desc)) {
        cat("\nMust provide either a filename with mapFileNames or a file description\n")
        stop("\nWhen desc=NULL must have non-null entries for both filename= and mapFileNames=\n")
    }
    
    # Find the description if it is NULL (default)
    if (is.null(desc)) {
        desc <- getLocaleDescription(filename, mapper=mapFileNames)
    }
    
    # Warn if both filename and tbl are passed, since tbl will be used
    if (!is.null(filename) & !is.null(tbl)) {
        cat("\nA tibble has been passed and will be used as the dataset for this function\n")
        warning("\nArgument filename=", filename, " is NOT loaded since a tibble was passed\n")
    }
    
    # Read in the file unless tbl has already been passed to the routine
    if (is.null(tbl)) {
        tbl <- readRDS(paste0(path, filename))
    }
    
    # Plot counts by metric
    plotcountsByMetric(tbl, mets=mets, title=desc, diagnose=TRUE)
    
    # Plot relationships between two variables
    for (ys in corPairs) {
        plotNumCor(tbl, var1=ys[1], var2=ys[2], subT=desc, diagnose=TRUE)
    }
    
    # plot numeric vs. factor
    for (ys in fctPairs) {
        plotFactorNumeric(tbl, fctVar=ys[1], numVar=ys[2], subT=desc, showXLabel=FALSE, diagnose=TRUE)
    }
    
    # Heatmap for variable correlations
    corMETAR(tbl, numVars=heatVars, subT=paste0(desc, " METAR"))

    # Run linear rergression
    for (ys in lmVars) {
        lmMETAR(tbl, y=ys[1], x=ys[2], yName=varMapper[ys[1]], subT=desc)
    }
    
    # Run the basic wind plots
    basicWindPlots(tbl, desc=desc, gran="")
    
    # Return the tibble
    tbl
    
}

The process can then be run for each of Detroit (2016), Las Vegas (2016), and New Orleans (2016):

# Retrieve the Detroit, MI (2016) data
kdtw_2016 <- combinedEDA("metar_kdtw_2016.rds")
## 
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
## 
## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 651 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Dropping 1 rows with 49 observations due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
## Removing 49 records due to NA

## 
##  *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## TempF       1.00  1.00  0.92  0.92     -0.17  -0.24     -0.10       0.14
## DewC        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## DewF        0.92  0.92  1.00  1.00     -0.22  -0.28     -0.18      -0.07
## Altimeter  -0.17 -0.17 -0.22 -0.22      1.00   1.00     -0.37       0.15
## modSLP     -0.24 -0.24 -0.28 -0.28      1.00   1.00     -0.35       0.14
## WindSpeed  -0.10 -0.10 -0.18 -0.18     -0.37  -0.35      1.00       0.10
## Visibility  0.14  0.14 -0.07 -0.07      0.15   0.14      0.10       1.00
## monthint    0.24  0.24  0.32  0.32      0.19   0.17     -0.06      -0.11
## day         0.04  0.04  0.03  0.03     -0.02  -0.02      0.06       0.04
##            monthint   day
## TempC          0.24  0.04
## TempF          0.24  0.04
## DewC           0.32  0.03
## DewF           0.32  0.03
## Altimeter      0.19 -0.02
## modSLP         0.17 -0.02
## WindSpeed     -0.06  0.06
## Visibility    -0.11  0.04
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96361 -0.44022 -0.03758  0.40772  1.41448 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.44389    0.74723  -22.01   <2e-16 ***
## Altimeter    34.41514    0.02488 1383.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4961 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9954 
## F-statistic: 1.913e+06 on 1 and 8767 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57802 -0.12674  0.00481  0.12820  0.65708 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.713e+00  2.803e-01  -13.25   <2e-16 ***
## Altimeter    3.403e+01  9.302e-03 3658.80   <2e-16 ***
## TempF       -2.351e-02  9.941e-05 -236.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1826 on 8766 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.086e+06 on 2 and 8766 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.132  -6.065  -2.083   4.042  31.029 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.060663   0.211976   52.18   <2e-16 ***
## DewF         1.000977   0.004677  214.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.987 on 8767 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8393 
## F-statistic: 4.58e+04 on 1 and 8767 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3544  -2.5454  -0.1494   2.3669  20.6042 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 306.827720   6.726512  45.615  < 2e-16 ***
## Altimeter    -9.980907   0.222840 -44.790  < 2e-16 ***
## TempF         0.206384   0.006044  34.145  < 2e-16 ***
## DewF         -0.217533   0.006755 -32.204  < 2e-16 ***
## monthFeb      0.188758   0.204057   0.925   0.3550    
## monthMar     -1.375926   0.211244  -6.513 7.75e-11 ***
## monthApr     -2.406187   0.217870 -11.044  < 2e-16 ***
## monthMay     -3.616016   0.248432 -14.555  < 2e-16 ***
## monthJun     -4.337974   0.277532 -15.631  < 2e-16 ***
## monthJul     -3.428235   0.300872 -11.394  < 2e-16 ***
## monthAug     -2.971582   0.312087  -9.522  < 2e-16 ***
## monthSep     -1.839206   0.292614  -6.285 3.43e-10 ***
## monthOct     -0.477868   0.255640  -1.869   0.0616 .  
## monthNov     -0.013008   0.226830  -0.057   0.9543    
## monthDec      2.071836   0.200848  10.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.883 on 8754 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.3218, Adjusted R-squared:  0.3207 
## F-statistic: 296.7 on 14 and 8754 DF,  p-value: < 2.2e-16

# Retrieve the Las Vegas, NV (2016) data
klas_2016 <- combinedEDA("metar_klas_2016.rds")
## 
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
## 
## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 2440 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Dropping 1 rows with 35 observations due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
## Removing 35 records due to NA

## 
##  *** Correlations use 8783 complete cases (99.6% of 8818 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.22  0.22     -0.51  -0.62      0.22       0.01
## TempF       1.00  1.00  0.22  0.22     -0.51  -0.62      0.22       0.01
## DewC        0.22  0.22  1.00  1.00     -0.24  -0.27     -0.04      -0.13
## DewF        0.22  0.22  1.00  1.00     -0.24  -0.27     -0.04      -0.13
## Altimeter  -0.51 -0.51 -0.24 -0.24      1.00   0.99     -0.38       0.06
## modSLP     -0.62 -0.62 -0.27 -0.27      0.99   1.00     -0.38       0.06
## WindSpeed   0.22  0.22 -0.04 -0.04     -0.38  -0.38      1.00      -0.02
## Visibility  0.01  0.01 -0.13 -0.13      0.06   0.06     -0.02       1.00
## monthint    0.14  0.14  0.06  0.06      0.06   0.03     -0.01      -0.02
## day        -0.01 -0.01  0.03  0.03     -0.01  -0.01      0.00      -0.07
##            monthint   day
## TempC          0.14 -0.01
## TempF          0.14 -0.01
## DewC           0.06  0.03
## DewF           0.06  0.03
## Altimeter      0.06 -0.01
## modSLP         0.03 -0.01
## WindSpeed     -0.01  0.00
## Visibility    -0.02 -0.07
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1381 -0.7848 -0.0607  0.6891  3.6362 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -116.30984    1.72975  -67.24   <2e-16 ***
## Altimeter     37.68826    0.05776  652.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9753 on 8781 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9798 
## F-statistic: 4.258e+05 on 1 and 8781 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18161 -0.33113  0.02395  0.33395  1.08509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.537e+01  8.693e-01  -29.18   <2e-16 ***
## Altimeter    3.478e+01  2.868e-02 1212.89   <2e-16 ***
## TempF       -5.581e-02  2.813e-04 -198.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4165 on 8780 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.9963, Adjusted R-squared:  0.9963 
## F-statistic: 1.187e+06 on 2 and 8780 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.026 -14.381  -0.341  13.299  46.312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.12523    0.49251  126.14   <2e-16 ***
## DewF         0.32810    0.01576   20.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.94 on 8781 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.04705,    Adjusted R-squared:  0.04694 
## F-statistic: 433.5 on 1 and 8781 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1199  -2.7665  -0.6134   2.0975  22.2630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 316.166903   9.583584  32.990  < 2e-16 ***
## Altimeter   -10.363527   0.316290 -32.766  < 2e-16 ***
## TempF         0.012196   0.005239   2.328 0.019933 *  
## DewF         -0.053040   0.004107 -12.915  < 2e-16 ***
## monthFeb      1.477528   0.225729   6.546 6.26e-11 ***
## monthMar      0.848312   0.232319   3.652 0.000262 ***
## monthApr      1.590035   0.242532   6.556 5.84e-11 ***
## monthMay      0.327608   0.261841   1.251 0.210905    
## monthJun      0.141204   0.319290   0.442 0.658325    
## monthJul      1.092983   0.328362   3.329 0.000876 ***
## monthAug      0.545515   0.319165   1.709 0.087450 .  
## monthSep      0.647423   0.281377   2.301 0.021420 *  
## monthOct      1.147640   0.253286   4.531 5.95e-06 ***
## monthNov      1.097005   0.227198   4.828 1.40e-06 ***
## monthDec      0.672432   0.214810   3.130 0.001752 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.181 on 8768 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.1774, Adjusted R-squared:  0.176 
## F-statistic:   135 on 14 and 8768 DF,  p-value: < 2.2e-16

# Retrieve the New Orleans, LA (2016) data
kmsy_2016 <- combinedEDA("metar_kmsy_2016.rds")
## 
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
## 
## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 1029 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Dropping 1 rows with 14 observations due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
## Removing 14 records due to NA

## 
##  *** Correlations use 8799 complete cases (99.8% of 8813 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.85  0.85     -0.48  -0.47     -0.09       0.13
## TempF       1.00  1.00  0.85  0.85     -0.48  -0.48     -0.09       0.13
## DewC        0.85  0.85  1.00  1.00     -0.56  -0.56     -0.21      -0.06
## DewF        0.85  0.85  1.00  1.00     -0.57  -0.56     -0.21      -0.05
## Altimeter  -0.48 -0.48 -0.56 -0.57      1.00   1.00     -0.08       0.08
## modSLP     -0.47 -0.48 -0.56 -0.56      1.00   1.00     -0.08       0.08
## WindSpeed  -0.09 -0.09 -0.21 -0.21     -0.08  -0.08      1.00       0.03
## Visibility  0.13  0.13 -0.06 -0.05      0.08   0.08      0.03       1.00
## monthint    0.26  0.26  0.31  0.31      0.09   0.09     -0.22      -0.02
## day         0.03  0.03  0.07  0.07      0.02   0.02     -0.05       0.01
##            monthint   day
## TempC          0.26  0.03
## TempF          0.26  0.03
## DewC           0.31  0.07
## DewF           0.31  0.07
## Altimeter      0.09  0.02
## modSLP         0.09  0.02
## WindSpeed     -0.22 -0.05
## Visibility    -0.02  0.01
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.232515 -0.083700  0.000544  0.084787  0.227320 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.773456   0.230067   12.05   <2e-16 ***
## Altimeter   33.779607   0.007654 4413.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1015 on 8797 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 1.948e+07 on 1 and 8797 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222066 -0.082359 -0.001036  0.083533  0.217460 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.557e+00  2.638e-01    5.903  3.7e-09 ***
## Altimeter   3.382e+01  8.667e-03 3902.029  < 2e-16 ***
## TempF       8.751e-04  9.432e-05    9.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.101 on 8796 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 9.833e+06 on 2 and 8796 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.051  -4.777  -1.265   4.199  26.477 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.440028   0.319858   76.41   <2e-16 ***
## DewF         0.778877   0.005065  153.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.763 on 8797 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.7288, Adjusted R-squared:  0.7288 
## F-statistic: 2.364e+04 on 1 and 8797 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9452  -2.5710  -0.2247   2.2592  18.9024 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 190.986029  11.812622  16.168  < 2e-16 ***
## Altimeter    -6.223347   0.387357 -16.066  < 2e-16 ***
## TempF         0.225378   0.007290  30.915  < 2e-16 ***
## DewF         -0.169883   0.006513 -26.082  < 2e-16 ***
## monthFeb     -0.787460   0.204817  -3.845 0.000122 ***
## monthMar     -0.270554   0.215758  -1.254 0.209885    
## monthApr     -1.941174   0.225642  -8.603  < 2e-16 ***
## monthMay     -3.763407   0.241926 -15.556  < 2e-16 ***
## monthJun     -5.215285   0.269852 -19.326  < 2e-16 ***
## monthJul     -5.568010   0.283515 -19.639  < 2e-16 ***
## monthAug     -4.834769   0.277122 -17.446  < 2e-16 ***
## monthSep     -5.769099   0.272789 -21.149  < 2e-16 ***
## monthOct     -4.812814   0.245841 -19.577  < 2e-16 ***
## monthNov     -2.827680   0.220138 -12.845  < 2e-16 ***
## monthDec     -0.203742   0.210552  -0.968 0.333241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.847 on 8784 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2278, Adjusted R-squared:  0.2266 
## F-statistic: 185.1 on 14 and 8784 DF,  p-value: < 2.2e-16

Directing Output to a Separate File

There is a lot of good information in the EDA, but it can be overwhelming to have everything in one place. Perhaps a wrapper function can be built allowing for outputs to be piped to a given file:

wrapCombinedEDA <- function(readFile, 
                            readPath="./RInputFiles/ProcessedMETAR/", 
                            mapFileNames=cityNameMapper,
                            desc=NULL,
                            writeLogFile=NULL,
                            writeLogPDF=NULL,
                            writeLogPath=NULL,
                            appendWriteFile=FALSE,
                            ...
                            ) {
    
    # Read in the requested file
    tbl <- readRDS(paste0(readPath, readFile))

    # Find the description if it has not been passed
    if (is.null(desc)) {
        desc <- getLocaleDescription(readFile, mapper=mapFileNames)
    }
    
    # Helper function that only runs the combinedEDA() routine
    coreFunc <- function() { combinedEDA(tbl=tbl, desc=desc, mapFileNames=mapFileNames, ...) }
    
    # If writeLogPDF is not NULL, direct the graphs to a suitable PDF
    if (!is.null(writeLogPDF)) {
        
        # Prepend the provided log path if it has not been made available
        if (!is.null(writeLogPath)) {
            writeLogPDF <- paste0(writeLogPath, writeLogPDF)
        }
        
        # Provide the location of the EDA pdf file
        cat("\nEDA PDF file is available at:", writeLogPDF, "\n")

        # Redirect the writing to writeLogPDF
        pdf(writeLogPDF)
    }
    
    # Run EDA on the tbl using capture.output to redirect to a log file if specified
    if (!is.null(writeLogFile)) {
        
        # Prepend the provided log path if it has not been made available
        if (!is.null(writeLogPath)) {
            writeLogFile <- paste0(writeLogPath, writeLogFile)
        }
        
        # Provide the location of the EDA log file
        cat("\nEDA log file is available at:", writeLogFile, "\n")
        
        # Run EDA such that the output goes to the log file
        capture.output(coreFunc(), 
                       file=writeLogFile, 
                       append=appendWriteFile
                       )
        
    } else {
        # Run EDA such that output stays in stdout
        coreFunc()
    }
    
    # If writeLogPDF is not NULL, redirect to stdout
    if (!is.null(writeLogPDF)) {
        dev.off()
    }
    
    # Return the tbl
    tbl
    
}
filePath <- "./RInputFiles/ProcessedMETAR/"

# Example for the basic function for Chicago, IL (2016) written to stdout
kord_2016 <- wrapCombinedEDA("metar_kord_2016.rds", readPath=filePath)
## 
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
## 
## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 823 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Dropping 1 rows with 10 observations due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
## Removing 10 records due to NA

## 
##  *** Correlations use 8805 complete cases (99.9% of 8815 total) ***
##            TempC TempF  DewC  DewF Altimeter modSLP WindSpeed Visibility
## TempC       1.00  1.00  0.93  0.93     -0.22  -0.29     -0.12       0.19
## TempF       1.00  1.00  0.93  0.93     -0.22  -0.29     -0.12       0.19
## DewC        0.93  0.93  1.00  1.00     -0.27  -0.34     -0.19       0.05
## DewF        0.93  0.93  1.00  1.00     -0.28  -0.34     -0.19       0.05
## Altimeter  -0.22 -0.22 -0.27 -0.28      1.00   1.00     -0.31       0.19
## modSLP     -0.29 -0.29 -0.34 -0.34      1.00   1.00     -0.29       0.17
## WindSpeed  -0.12 -0.12 -0.19 -0.19     -0.31  -0.29      1.00       0.01
## Visibility  0.19  0.19  0.05  0.05      0.19   0.17      0.01       1.00
## monthint    0.24  0.24  0.27  0.27      0.18   0.16     -0.09       0.02
## day         0.05  0.05  0.05  0.05     -0.06  -0.06      0.08       0.04
##            monthint   day
## TempC          0.24  0.05
## TempF          0.24  0.05
## DewC           0.27  0.05
## DewF           0.27  0.05
## Altimeter      0.18 -0.06
## modSLP         0.16 -0.06
## WindSpeed     -0.09  0.08
## Visibility     0.02  0.04
## monthint       1.00  0.02
## day            0.02  1.00

## 
##  *** Regression call is: modSLP ~ Altimeter ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31041 -0.47561 -0.07852  0.43888  1.66927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.11596    0.85978  -26.89   <2e-16 ***
## Altimeter    34.63779    0.02864 1209.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5573 on 8803 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 1.463e+06 on 1 and 8803 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: modSLP ~ Altimeter + TempF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69460 -0.12269  0.00055  0.12142  0.73628 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.603e+00  2.864e-01  -16.07   <2e-16 ***
## Altimeter    3.407e+01  9.502e-03 3585.30   <2e-16 ***
## TempF       -2.606e-02  9.503e-05 -274.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1804 on 8802 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 7.018e+06 on 2 and 8802 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: TempF ~ DewF ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.638  -5.485  -1.570   3.601  35.456 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.317877   0.188557   54.72   <2e-16 ***
## DewF         1.004504   0.004095  245.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 8803 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.8724, Adjusted R-squared:  0.8724 
## F-statistic: 6.019e+04 on 1 and 8803 DF,  p-value: < 2.2e-16

## 
##  *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
## 
## Call:
## lm(formula = formula(myChar), data = met)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.295  -2.635  -0.157   2.494  21.356 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 291.341063   7.156664  40.709  < 2e-16 ***
## Altimeter    -9.415029   0.237118 -39.706  < 2e-16 ***
## TempF         0.160498   0.006445  24.904  < 2e-16 ***
## DewF         -0.205407   0.006912 -29.718  < 2e-16 ***
## monthFeb      0.649436   0.210063   3.092 0.001997 ** 
## monthMar      0.033603   0.220448   0.152 0.878850    
## monthApr      0.092196   0.230608   0.400 0.689317    
## monthMay     -0.859361   0.258300  -3.327 0.000882 ***
## monthJun     -1.564544   0.296635  -5.274 1.36e-07 ***
## monthJul     -1.191767   0.314437  -3.790 0.000152 ***
## monthAug     -1.319054   0.319647  -4.127 3.72e-05 ***
## monthSep      0.394263   0.299972   1.314 0.188768    
## monthOct      0.511074   0.264003   1.936 0.052916 .  
## monthNov      0.050679   0.235873   0.215 0.829882    
## monthDec      1.741405   0.205243   8.485  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.984 on 8790 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2482 
## F-statistic: 208.6 on 14 and 8790 DF,  p-value: < 2.2e-16

# Example for the basic function for San Diego, CA (2016) written to 'metar_ksan_2016_EDA.log'
ksan_2016 <- wrapCombinedEDA("metar_ksan_2016.rds", 
                             readPath=filePath, 
                             writeLogFile='metar_ksan_2016_EDA.log',
                             writeLogPDF='metar_ksan_2016_EDA.pdf', 
                             writeLogPath=filePath
                             )
## 
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log

Finally, a function can be called to create the inputs to wrapCombinedEDA() for lower typing:

logAndPDFCombinedEDA <- function(tblName, filePath="./RInputFiles/ProcessedMETAR/") {
    
    # Create the RDS file name
    rdsName <- paste0("metar_", tblName, ".rds")
    cat("\nRDS Name:", rdsName)
    
    # Create the log file name
    logName <- paste0("metar_", tblName, "_EDA.log")
    cat("\nLog Name:", logName)
    
    # Create the PDF file name
    pdfName <- paste0("metar_", tblName, "_EDA.pdf")
    cat("\nPDF Name:", pdfName)
    
    # Call wrapCombinedEDA()
    tbl <- wrapCombinedEDA(rdsName, 
                           readPath=filePath, 
                           writeLogFile=logName, 
                           writeLogPDF=pdfName,
                           writeLogPath=filePath
                           )
    
    # Return the tbl
    tbl
    
}

This function can then be run for all of the relevant files (cached to avoid multiple runs):

# Run for 2016 only for kdtw, kewr, kgrb, kgrr, kiah, kind, klnk, kmke, kmsn, kmsp, ktvc
kdtw_2016 <- logAndPDFCombinedEDA("kdtw_2016")
## 
## RDS Name: metar_kdtw_2016.rds
## Log Name: metar_kdtw_2016_EDA.log
## PDF Name: metar_kdtw_2016_EDA.pdf
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.log
kewr_2016 <- logAndPDFCombinedEDA("kewr_2016")
## 
## RDS Name: metar_kewr_2016.rds
## Log Name: metar_kewr_2016_EDA.log
## PDF Name: metar_kewr_2016_EDA.pdf
## Will use Newark, NJ (2016) as the description for metar_kewr_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.log
kgrb_2016 <- logAndPDFCombinedEDA("kgrb_2016")
## 
## RDS Name: metar_kgrb_2016.rds
## Log Name: metar_kgrb_2016_EDA.log
## PDF Name: metar_kgrb_2016_EDA.pdf
## Will use Green Bay, WI (2016) as the description for metar_kgrb_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.log
kgrr_2016 <- logAndPDFCombinedEDA("kgrr_2016")
## 
## RDS Name: metar_kgrr_2016.rds
## Log Name: metar_kgrr_2016_EDA.log
## PDF Name: metar_kgrr_2016_EDA.pdf
## Will use Grand Rapids, MI (2016) as the description for metar_kgrr_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.log
kiah_2016 <- logAndPDFCombinedEDA("kiah_2016")
## 
## RDS Name: metar_kiah_2016.rds
## Log Name: metar_kiah_2016_EDA.log
## PDF Name: metar_kiah_2016_EDA.pdf
## Will use Houston, TX (2016) as the description for metar_kiah_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.log
kind_2016 <- logAndPDFCombinedEDA("kind_2016")
## 
## RDS Name: metar_kind_2016.rds
## Log Name: metar_kind_2016_EDA.log
## PDF Name: metar_kind_2016_EDA.pdf
## Will use Indianapolis, IN (2016) as the description for metar_kind_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.log
klnk_2016 <- logAndPDFCombinedEDA("klnk_2016")
## 
## RDS Name: metar_klnk_2016.rds
## Log Name: metar_klnk_2016_EDA.log
## PDF Name: metar_klnk_2016_EDA.pdf
## Will use Lincoln, NE (2016) as the description for metar_klnk_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.log
kmke_2016 <- logAndPDFCombinedEDA("kmke_2016")
## 
## RDS Name: metar_kmke_2016.rds
## Log Name: metar_kmke_2016_EDA.log
## PDF Name: metar_kmke_2016_EDA.pdf
## Will use Milwaukee, WI (2016) as the description for metar_kmke_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.log
kmsn_2016 <- logAndPDFCombinedEDA("kmsn_2016")
## 
## RDS Name: metar_kmsn_2016.rds
## Log Name: metar_kmsn_2016_EDA.log
## PDF Name: metar_kmsn_2016_EDA.pdf
## Will use Madison, WI (2016) as the description for metar_kmsn_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.log
kmsp_2016 <- logAndPDFCombinedEDA("kmsp_2016")
## 
## RDS Name: metar_kmsp_2016.rds
## Log Name: metar_kmsp_2016_EDA.log
## PDF Name: metar_kmsp_2016_EDA.pdf
## Will use Minneapolis, MN (2016) as the description for metar_kmsp_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.log
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
## 
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
# Run for 2015-2016-2017 for klas, kmsy, kord, ksan
klas_2015 <- logAndPDFCombinedEDA("klas_2015")
## 
## RDS Name: metar_klas_2015.rds
## Log Name: metar_klas_2015_EDA.log
## PDF Name: metar_klas_2015_EDA.pdf
## Will use Las Vegas, NV (2015) as the description for metar_klas_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.log
klas_2016 <- logAndPDFCombinedEDA("klas_2016")
## 
## RDS Name: metar_klas_2016.rds
## Log Name: metar_klas_2016_EDA.log
## PDF Name: metar_klas_2016_EDA.pdf
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.log
klas_2017 <- logAndPDFCombinedEDA("klas_2017")
## 
## RDS Name: metar_klas_2017.rds
## Log Name: metar_klas_2017_EDA.log
## PDF Name: metar_klas_2017_EDA.pdf
## Will use Las Vegas, NV (2017) as the description for metar_klas_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.log
kmsy_2015 <- logAndPDFCombinedEDA("kmsy_2015")
## 
## RDS Name: metar_kmsy_2015.rds
## Log Name: metar_kmsy_2015_EDA.log
## PDF Name: metar_kmsy_2015_EDA.pdf
## Will use New Orleans, LA (2015) as the description for metar_kmsy_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.log
kmsy_2016 <- logAndPDFCombinedEDA("kmsy_2016")
## 
## RDS Name: metar_kmsy_2016.rds
## Log Name: metar_kmsy_2016_EDA.log
## PDF Name: metar_kmsy_2016_EDA.pdf
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.log
kmsy_2017 <- logAndPDFCombinedEDA("kmsy_2017")
## 
## RDS Name: metar_kmsy_2017.rds
## Log Name: metar_kmsy_2017_EDA.log
## PDF Name: metar_kmsy_2017_EDA.pdf
## Will use New Orleans, LA (2017) as the description for metar_kmsy_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.log
kord_2015 <- logAndPDFCombinedEDA("kord_2015")
## 
## RDS Name: metar_kord_2015.rds
## Log Name: metar_kord_2015_EDA.log
## PDF Name: metar_kord_2015_EDA.pdf
## Will use Chicago, IL (2015) as the description for metar_kord_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.log
kord_2016 <- logAndPDFCombinedEDA("kord_2016")
## 
## RDS Name: metar_kord_2016.rds
## Log Name: metar_kord_2016_EDA.log
## PDF Name: metar_kord_2016_EDA.pdf
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.log
kord_2017 <- logAndPDFCombinedEDA("kord_2017")
## 
## RDS Name: metar_kord_2017.rds
## Log Name: metar_kord_2017_EDA.log
## PDF Name: metar_kord_2017_EDA.pdf
## Will use Chicago, IL (2017) as the description for metar_kord_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.log
ksan_2015 <- logAndPDFCombinedEDA("ksan_2015")
## 
## RDS Name: metar_ksan_2015.rds
## Log Name: metar_ksan_2015_EDA.log
## PDF Name: metar_ksan_2015_EDA.pdf
## Will use San Diego, CA (2015) as the description for metar_ksan_2015.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.log
ksan_2016 <- logAndPDFCombinedEDA("ksan_2016")
## 
## RDS Name: metar_ksan_2016.rds
## Log Name: metar_ksan_2016_EDA.log
## PDF Name: metar_ksan_2016_EDA.pdf
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
ksan_2017 <- logAndPDFCombinedEDA("ksan_2017")
## 
## RDS Name: metar_ksan_2017.rds
## Log Name: metar_ksan_2017_EDA.log
## PDF Name: metar_ksan_2017_EDA.pdf
## Will use San Diego, CA (2017) as the description for metar_ksan_2017.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.log

Reload Traverse City, MI (2016) due to a previous error (short-term fix to avoid re-running a full cache):

ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
## 
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
## 
## 
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf 
## 
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log

The files are now available for further analysis, with individual PDF files for plots associated with each locale.

Comparing Multiple Locales

With EDA about each locale saved to .pdf and .log files, it is interesting to investigate comparisons among the various locales. For example, how do the temperature patterns by month for a given locale compare to the overall global median temperature patterns by month?

The existing functions contain most of the code needed to perform this. The main steps to add are:

  1. Combine processed data files
  2. Adapt the base functions to create the overall mean or median, when requested by parameter
  3. Adapt the base functions to add the overall mean or median to the plot, when requested by parameter
  4. Facet the plot by locale, when requested by parameter

The first step is to combine one or more processed files, with a column added for locale:

# Combine files from a character list
combineProcessedFiles <- function(charList, mapper=cityNameMapper) {
    
    # Combine the objects represented by charList, and name the list items using charList
    listFiles <- lapply(charList, FUN=function(x) { get(x) })
    names(listFiles) <- charList
    
    # Bind rows, and add the descriptive locale name as sourceName
    tblFiles <- bind_rows(listFiles, .id="source") %>%
        mutate(sourceName=mapper[source])
    
    tblFiles
    
}

The 2016 data will be used to run the combined process:

# Grab all the data that ends in _2016
locales2016 <- ls() %>%
    grep(pattern="_2016", value=TRUE)
cat("\nLocales used will be:\n\n", paste0(locales2016, collapse="\n"), "\n\n", sep="")
## 
## Locales used will be:
## 
## kdtw_2016
## kewr_2016
## kgrb_2016
## kgrr_2016
## kiah_2016
## kind_2016
## klas_2016
## klnk_2016
## kmke_2016
## kmsn_2016
## kmsp_2016
## kmsy_2016
## kord_2016
## ksan_2016
## ktvc_2016
# Combine the 2016 data
all2016Data <- combineProcessedFiles(locales2016)

# Show counts by sourceName
all2016Data %>%
    count(source, sourceName)
## # A tibble: 15 x 3
##    source    sourceName                   n
##    <chr>     <chr>                    <int>
##  1 kdtw_2016 Detroit, MI (2016)        8818
##  2 kewr_2016 Newark, NJ (2016)         8821
##  3 kgrb_2016 Green Bay, WI (2016)      8803
##  4 kgrr_2016 Grand Rapids, MI (2016)   8812
##  5 kiah_2016 Houston, TX (2016)        8816
##  6 kind_2016 Indianapolis, IN (2016)   8767
##  7 klas_2016 Las Vegas, NV (2016)      8818
##  8 klnk_2016 Lincoln, NE (2016)        8813
##  9 kmke_2016 Milwaukee, WI (2016)      8808
## 10 kmsn_2016 Madison, WI (2016)        8798
## 11 kmsp_2016 Minneapolis, MN (2016)    8817
## 12 kmsy_2016 New Orleans, LA (2016)    8813
## 13 kord_2016 Chicago, IL (2016)        8815
## 14 ksan_2016 San Diego, CA (2016)      8810
## 15 ktvc_2016 Traverse City, MI (2016)  8814

The following global summaries will be useful:

  • plotCountsByMetric() - overall percentage by metric, applied to total counts for locale
  • plotNumCor() - overall geom_smooth()
  • plotFactorNumeric() - overall mean/median of numeric by factor
  • corMETAR() and lmMETAR() - not applicable, though could be run on full dataset
  • basicWindPlots() - tbd, perhaps adapt along with consolidatePlotWind
  • consolidatePlotWind() - tbd, perhaps adapt along with basicWindPlots

Helper functions can be created for:

  • helperCountsByMetric() - get the overall percentage by metric for variable x
  • helperNumCor() - get an overall geom_smooth for variable y vs. variable x
  • helperFactorNumeric() - get the overall mean or median for numeric variable y by factor variable x

The function helperFactorNumeric is created to apply function f to numeric variable y by factor variable x:

# Helper function to get overall percentage by metric for variable x
helperCountsByMetric <- function(tbl, ctVar, sumOn="dummyVar") {

    tbl %>%
        mutate(dummyVar=1) %>%
        select_at(vars(all_of(c(ctVar, sumOn)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(ctVar) %>%
        summarize(n=sum(get(sumOn))) %>%
        mutate(nPct=n/sum(n))
        
}

# Example run to get counts by greatest sky obscuration
helperCountsByMetric(all2016Data, ctVar="wType")
## # A tibble: 7 x 3
##   wType     n    nPct
##   <fct> <dbl>   <dbl>
## 1 VV      761 0.00576
## 2 OVC   39480 0.299  
## 3 BKN   31352 0.237  
## 4 SCT   16212 0.123  
## 5 FEW   19304 0.146  
## 6 CLR   24433 0.185  
## 7 Error   601 0.00455
# Helper function to get a geom_smooth for variable y vs variable x
helperNumCor <- function(tbl, 
                         xVar, 
                         yVar, 
                         sumOn="dummyVar",
                         se=TRUE, 
                         color="red", 
                         method="lm", 
                         lty=2
                         ) {
    
    # Generate the overall totals for sumOn by xVar and yVar
    plotData <- tbl %>%
        mutate(dummyVar=1) %>%
        select_at(vars(all_of(c(xVar, yVar, sumOn)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(vars(all_of(c(xVar, yVar)))) %>%
        summarize(nTotal=sum(get(sumOn)))
    
    geom_smooth(data=plotData, 
                aes_string(x=xVar, y=yVar, weight="nTotal"), 
                se=se, 
                color=color, 
                method=method, 
                lty=lty
                )
    
}

# Example run to get TempC vs DewC
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
## mapping: weight = ~nTotal, x = ~TempC, y = ~DewC 
## geom_smooth: na.rm = FALSE, se = TRUE
## stat_smooth: na.rm = FALSE, se = TRUE, method = lm, formula = y ~ x
## position_identity
# Example for using the helper function on a plot
plotNumCor(kdtw_2016, var1="TempC", var2="DewC") + 
    helperNumCor(all2016Data, xVar="TempC", yVar="DewC")

# Helper function to calculate .f(numVar) by byVar
helperFactorNumeric <- function(tbl, .f, byVar, numVar, ...) {
    
    tbl %>%
        select_at(vars(all_of(c(byVar, numVar)))) %>%
        filter_all(all_vars(!is.na(.))) %>%
        group_by_at(byVar) %>%
        summarize(helpFN=.f(get(numVar), ...))
    
}

# Example for getting median TempF by month
helperFactorNumeric(all2016Data, .f=median, byVar="month", numVar="TempF")
## # A tibble: 12 x 2
##    month helpFN
##    <fct>  <dbl>
##  1 Jan     30.9
##  2 Feb     35.1
##  3 Mar     46.9
##  4 Apr     52.0
##  5 May     64.0
##  6 Jun     73.0
##  7 Jul     77  
##  8 Aug     75.9
##  9 Sep     71.1
## 10 Oct     61.0
## 11 Nov     50  
## 12 Dec     34.0

The function plotCountsByMetric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Previous Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Detroit, MI (2016)", 
                   dropNA=TRUE, 
                   diagnose=TRUE
                   )
## 
## Dropping 1 rows with 49 observations due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

plotcountsByMetric(useData, 
                   mets=c("WindDir", "cType1", "month", "wType"), 
                   title="Comparison Across Locales (red dots are the median)", 
                   dropNA=TRUE, 
                   diagnose=TRUE, 
                   facetOn="sourceName", 
                   showCentral=TRUE
                   )
## 
## Dropping 4 rows with 117 observations due to NA

As observed previously, Las Vegas tends towards southerly winds while San Diego tends towards northwesterly winds and calm (direction 000) winds. Detroit is most likely to be overcast, while Las Vegas is most likely to be clear.

The function plotNumCor() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
## 
## Dropping 1 rows with 49 observations due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

# Facetted plot for very highly correlated variables TempC and TempF
plotNumCor(useData, 
           var1="TempC", 
           var2="TempF", 
           subT="Comparison Across Locales (red dashed lines are the overall)", 
           dropNA=TRUE, 
           diagnose=TRUE, 
           facetOn="sourceName", 
           showCentral=TRUE
           )
## 
## Dropping 4 rows with 117 observations due to NA

# Facetted plot for highly correlated variables TempF and DewF
plotNumCor(useData, 
           var1="TempF", 
           var2="DewF", 
           subT="Comparison Across Locales (red dashed lines are the overall)", 
           dropNA=TRUE, 
           diagnose=TRUE, 
           facetOn="sourceName", 
           showCentral=TRUE
           )
## 
## Dropping 4 rows with 117 observations due to NA

The differences in the relationships between temperature and dew point stand out:

  • Chicago and Detroit have a wide spread of both temperature and dew point, and they tend to rise and fall together
  • San Diego has a narrow spread of both temperature and dew point, and they have a lesser tendency to rise and fall together
  • Las Vegas has very little change in average dew point wven as temperatures range from 30-100 degrees F

In contrast, of course, temperatures measured in C and F all follow the same pattern regardless of city

The function plotFactorNumeric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:

# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Detroit, MI (2016)", 
                  showXLabel=FALSE,
                  diagnose=TRUE
                  )
## 
## Removing 49 records due to NA

# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
    filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))

plotFactorNumeric(useData, 
                  fctVar="month", 
                  numVar="TempF", 
                  subT="Overall median shown as red dot", 
                  showXLabel=FALSE,
                  diagnose=TRUE,
                  facetOn="sourceName", 
                  showCentral=TRUE
                  )
## 
## Removing 117 records due to NA

Las Vegas consistently runs above the overall median temperature, while Chicago and Detroit run below the overall median temperature, particularly during the cold season. San Diego has little temperature variation by month and is thus below the median in the warm season and above the median in the cold season.

It is now possible to re-run the EDA plotting routines, focused on the 2016 data:

# Create rounded TempF and DewF in all2016Data
all2016Data <- all2016Data %>%
    mutate(TempF5=5*round(round(TempF)/5), 
           DewF5=5*round(round(DewF)/5), 
           WindSpeed5=5*round(WindSpeed/5),
           Altimeter10=round(Altimeter, 1)
           )

# Counts by Metric for all 2016 data
plotcountsByMetric(all2016Data, 
                   mets=c("month", "year",
                          "WindDir", "WindSpeed5", 
                          "Visibility", "Altimeter10",
                          "TempF5", "DewF5", 
                          "wType"
                          ), 
                   title="Comparisons Across Locales (red dots are the median)", 
                   facetOn="sourceName",
                   showCentral=TRUE
                   )

The cross-locale comparisons bring out a few salient features:

DATA VOLUMES:
* All locales have roughly the same amount of data by year and month, focused on 2016 and with 1-2 days on either side

WIND DIRECTION and WIND SPEED:
* Las Vegas has an excess of no/variable wind and of southerly winds, both appropriate for a desert
* Houston has an excess of no wind and southerly winds, both appropriate for the Gulf Coast
* San Diego has an excess of no wind and of northwesterly wind, both appropriate for the Pacific coast
* Chicago, Grand Rapids, Indianapolis, Minneapolis, and Newark all have lower occurences of no wind, appropriate for relatively cold mid-latitude cities
* Lincoln looks “about normal”, with the exception that it has slightly more southerly winds; Lincoln is the only Great Plains locale in the analysis, and it appears to be a blend of Gulf Coast and Wintry
* Detroit, Green Bay, Milwaukee, and New Orleans all look “about average”; this is not surprising in the first three cases given the predominance of cold, mid-latitude locales, but is surprising for New Orleans
* Madison and Traverse City are surprising in that both show a predominance of no/variable winds; this is uncommon for cold-weather cities in the mid-latitude and merits further examination

VISIBILITY:
* The overwhelming majority of visibilities are 10SM (the highest that is recorded; more or less means unlimited in the METAR)
* There is a data issue with a Visibility > 10 that should be addressed
* Las Vegas is slightly more likely than most to have unlimited Visibility and Detroit is slightly less likely than most to have unlimited visibility

ALTIMETER:
* Las Vegas skews low as appropriate for a high-altitude desert locale
* New Orleans and Houston show less variance, perhaps driven by being roughly at sea level and in close proximity to the Gulf of Mexico
* San Diego shows very low variance, perhaps driven by being roughly at sea level and in close proximity to the Pacific Ocean

TEMPERATURE:
* Houston, Las Vegas, and New Orleans skew warm as expected
* San Diego has ver low variance as expected
* At a gross level, the other cities look similar to the median, likely driven by the predominance of cold, mid-latitude locales in the data file

DEW POINT:
* Houston and New Orleans skew very high as expected
* Las Vegas skews very low as expected
* San Diego has very low variance as expected

SKY OBSCURATION:
* Lincoln and Green Bay are the most likely to be CLR (clear, no clouds on the automated sensor). This may be driven by a difference in maximum sensor heights, and is unexpected in Green Bay which should be frequently cloudy due to its latitude and proximity to a large body of water
* Detroit, Traverse City, Grand Rapids, and Minneapolis are especially likely to be overcast
* Las Vegas is especially likely to have few clouds or to be clear

Comparisons are run for a few of the numerical correlations:

# Example for 2016 - using mixes of WindSpeed, Altimeter, TempF, DewF, TempC, DewC
numCorList <- list(c("TempC", "TempF"), 
                   c("DewC", "DewF"), 
                   c("TempF", "DewF"), 
                   c("Altimeter", "WindSpeed"), 
                   c("Altimeter", "TempF")
                   )

# Run the list through plotNumCor()
for (x in numCorList) {
    plotNumCor(all2016Data, 
               var1=x[1], 
               var2=x[2], 
               subT="Red dashed line is the overall slope", 
               diagnose=TRUE, 
               facetOn="sourceName", 
               showCentral=TRUE
               )
}
## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

## 
## Dropping 15 rows with 593 observations due to NA

The cross-locale comparisons bring out a few salient features:

FAHRENHEIT AND CELSIUS:
* As expected, TempF/C are perfectly correlated and DewF/C are perfectly correlated. Since the observations were taken in the US, the TempF/DewF data will be used (TempC/DewC are conversions from the measured TempF/DewF to match the international standard for METAR reporting)

TEMPERATURE AND DEW POINT:
* While many cities have different clusters of temperature/dew point, all but Las Vegas and San Diego follow a pattern where the temperature and the dew point tend to run together at a similar rate
* In Las Vegas, the dew point is largely independent of the temperature
* In San Diego, there is less variance in temperature and dew point, and a lower (but still obvious) tendency for tenmperature and dew point to rise/fall together

ALTIMETER AND WIND SPEED:
* As expected, when the altimeter rises, on average, the wind speed falls
* This tendency is less pronounced in Houston and New Orleans; and more pronounced in Las Vegas, Dan Diego, and Indianapolis

ALTIMETER AND TEMPERATURE:
* Overall, low temperatures and high altimeters tend to be observed together
* This is especially so in Houston, Las Vegas, and New Orleans
* This is more modest in Grand Rapids, Green Bay, and Traverse City

Comparisons are then run for numeric variables against factor variables:

# Modify windDir so that it is just N, NE, E, SE, S, SW, W, NW, 000, Variable
mod2016Data <- all2016Data %>%
    mutate(tempDir=ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1, as.numeric(WindDir)),
           predomDir=factor(case_when(is.na(WindDir) ~ "Error", 
                                      WindDir=="000" ~ "000", 
                                      WindDir=="VRB" ~ "VRB", 
                                      tempDir >= 337.5 ~ "N", 
                                      tempDir <= 22.5 ~ "N",
                                      tempDir <= 67.5 ~ "NE", 
                                      tempDir <= 112.5 ~ "E", 
                                      tempDir <= 157.5 ~ "SE", 
                                      tempDir <= 202.5 ~ "S", 
                                      tempDir <= 247.5 ~ "SW", 
                                      tempDir <= 292.5 ~ "W", 
                                      tempDir <= 337.5 ~ "NW",
                                      TRUE ~ "Error"
                                      ), 
                            levels=c("Error", "000", "VRB", "NE", "E", "SE", "S", "SW", "W", "NW", "N")
                            )
           )
## Warning in ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1,
## as.numeric(WindDir)): NAs introduced by coercion
# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumList <- list(c("month", "WindSpeed"), 
                   c("month", "Altimeter"), 
                   c("month", "TempF"), 
                   c("month", "DewF"), 
                   c("month", "Visibility"),
                   c("wType", "WindSpeed"),
                   c("wType", "Altimeter"),
                   c("wType", "Visibility"),
                   c("predomDir", "WindSpeed"),
                   c("predomDir", "Altimeter"),
                   c("predomDir", "TempF"),
                   c("predomDir", "DewF")
                   )

for (x in fctNumList) {
    plotFactorNumeric(mod2016Data, 
                      fctVar=x[1], 
                      numVar=x[2], 
                      subT="Red dots are the overall average", 
                      showXLabel=FALSE,
                      diagnose=TRUE, 
                      facetOn="sourceName",
                      showCentral=TRUE
                      )
}
## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

The cross-locale comparisons bring out a few salient points:

WIND SPEED BY MONTH:
* The plot is too busy; need to rethink, potentially by removing the outliers and plotting only the box

ALTIMETER BY MONTH:
* Same as above; this plot is too busy

TEMPERATURE BY MONTH:
* Seasonal patterns are observed in most of the data, with a warm season centered around July and a cold season centered around December
* Las Vegas stand out for running warmer than the overall average in every month
* Houston, New Orleans, and San Diego stand out for running warmer than the overall average in the cold season and similar to the overall average in the warm season

DEW POINT BY MONTH:
* Seasonal patterns are observed in most of the data, with the humid seasons tracking with the warm seasons
* Houston and New Orleans run consistently above the average dew point by month
* San Diego runs above the average dew point during the cold season
* Las Vegas runs below the average dew point during the warm season

VISIBILITY BY MONTH:
* Visibilities are overwhelmingly likely to be 10 SM
* The Newark, NJ outlier described earlier (19 SM) should be deleted
* Detroit is especially likely to have Visibility less than 10 SM
* Grand Rapids and Traverse City have meaningul occurences of Visibility less than 10 SM during the cold season

WIND SPEED BY SKY OBSCURATION:
* This chart is too busy as per above

ALTIMETER BY SKY OBSCURATION:
* Not much at a glance

VISIBILITY BY SKY OBSCURATION:
* The VV and OVC sky obscurations are most associated with low visibilities; VV in particular is almost always associated with very low visibility

WIND SPEED BY WIND DIRECTION:
* Wind direction “000” is always associated with wind speed 0, as expected
* Wind direction “VRB” is always associated with a low but non-zero wind speed, as expected
* While some cities are windier than others, there is no pronounced tendency for wind speed to be highly associated with a given wind direction in any locale

ALTIMETER BY WIND DIRECTION:
* No gross trends observed
* The plot is rather busy, especially given the low variance in median/IQR for altimeter relative to the outlier points

TEMPERATURE/DEW POINT BY WIND DIRECTION:
* Plot is not great for reading and interpreting

Next steps are to modify a few of the plots for better interpretability (less busy, more variation of the core data metric relative to the full y-axis, etc.), investigate and rectify the data issues observed, and save a version of the file for further analysis.

A modified boxplot function is created to plot only the median and the IQR, with the goal of having more of the variance in the median visible on the plot:

plotMedianIQR <- function(met, 
                          fctVar, 
                          numVar, 
                          title=NULL, 
                          subT="", 
                          mid=0.5,
                          rng=c(0.25, 0.75),
                          diagnose=TRUE,
                          showXLabel=TRUE,
                          mapper=varMapper,
                          facetOn=NULL, 
                          showCentral=FALSE, 
                          ylimits=NULL
                          ) {
    
    # Function arguments
    # met: dataframe or tibble containing raw data
    # fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
    # numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
    # title: character vector for plot title
    # subT: character vector for plot subtitle
    # mid: float between 0 and 1 for the quantile to be used as the midpoint
    # rng: length-two float vector for (lo, hi) to be used as the dimensions of the box
    # diagnose: boolean for whether to note in the log the number of NA observations dropped
    # showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
    # mapper: named list containing mapping from variable name to well-formatted name for titles and axes
    # facetOn: a facetting variable for the supplied met (NULL for no faceting)
    # showCentral: boolean for whether to show the central tendency over-plotted on the main data
    # ylimits: length-two numeric for the y-axis minimum and maximum (default NULL uses plot defaults)
    
    # Function usage
    # 1.  By default, the function creates a modified boxplot of numVar by fctVar - line at mid, box going from rng[1] to rng[2]
    # 2.  If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
    # 3.  If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot

    
    # Check that the quantile variables are sensible
    if (length(mid) != 1 | length(rng) != 2) {
        stop("Must pass a single value as mid and a length-two vector as rng\n")    
    }
    if (min(c(mid, rng)) < 0 | max(c(mid, rng)) > 1) {
        stop("All values of mid and rng must be between 0 and 1, inclusive\n")
    }
    if ((mid < rng[1]) | (mid > rng[2])) {
        stop("mid must be at least as big as rng[1] and no greater than rng[2]\n")
    }
    quants <- paste0(round(100*c(rng[1], mid, rng[2]), 0), "%", collapse=" ")
    
    # Create the title if not passed
    if (is.null(title)) { 
        title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
    }
    
    # Remove the NA variables
    nOrig <- nrow(met)
    dat <- met %>%
        filter(!is.na(get(fctVar)), !is.na(get(numVar)))
    if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
    
    # Create the quantile data by fctVar and (if passed) facetOn
    groupVars <- fctVar 
    if (!is.null(facetOn)) { groupVars <- c(groupVars, facetOn) }
    plotData <- dat %>%
        group_by_at(groupVars) %>%
        summarize(midPoint=quantile(get(numVar), probs=mid), 
                  loPoint=quantile(get(numVar), probs=rng[1]),
                  hiPoint=quantile(get(numVar), probs=rng[2])
                  )
    
    # Create the base plot
    p <- plotData %>%
        ggplot(aes_string(x=fctVar, y="midPoint")) +
        geom_crossbar(aes(ymin=loPoint, ymax=hiPoint), fill="lightblue") +
        labs(title=title,
             subtitle=subT,
             x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""),
             y=paste0(mapper[numVar], " - ", numVar), 
             caption=paste0("Quantiles plotted: ", quants)
             )
    
    # If facetting has been requested, facet by the desired variable
    if (!is.null(facetOn)) {
        p <- p + facet_wrap(as.formula(paste("~", facetOn)))
    }
    
    # If showCentral=TRUE, add a dot plot for the overall value of 'mid'
    if (showCentral) {
        centData <- helperFactorNumeric(dat, .f=quantile, byVar=fctVar, numVar=numVar, probs=mid)
        p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
    }
    
    # If ylim has been passed, use it
    if (!is.null(ylimits)) {
        p <- p + ylim(ylimits)
    }

    # Render the final plot
    print(p)
    
}

The function can then be applied in an attempt to get a better look at a few of the comparisons:

# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumListIQR <- list(c("month", "WindSpeed"), 
                      c("month", "Altimeter"), 
                      c("wType", "WindSpeed"),
                      c("predomDir", "Altimeter"),
                      c("predomDir", "TempF"),
                      c("predomDir", "DewF")
                      )

for (x in fctNumListIQR) {
    plotMedianIQR(mod2016Data, 
                  fctVar=x[1], 
                  numVar=x[2], 
                  subT="Red dots are the overall mid-quantile", 
                  showXLabel=FALSE,
                  diagnose=TRUE, 
                  facetOn="sourceName",
                  showCentral=TRUE
                  )
}
## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

## 
## Removing 593 records due to NA

Limiting the observations to Q1-Median-Q2 brings out a bit more information in the plots:

WIND SPEED BY MONTH:
* San Diego and Las Vegas are meaningfully less windy than the median, especially during the cold season

ALTIMETER BY MONTH:
* San Diego has very little variance in altimeter relative to the other locales
* Las Vegas runs especially low on altimeter during the warm season
* There is a seasonal pattern where altimeters tend to run lower during the warm season while also showing a much smaller Q1-Q3 range

WIND SPEED BY SKY OBSCURATION:
* Newark shows very high wind speeds when obscuration is VV
* Should re-run with data y-axis capped at 15 to visualize the other locales

ALTIMETER BY WIND DIRECTION:
* Altimeters tend to be a bit lower when winds are from the S-SW-W-NW, a pattern that seems consistent across all locales
* Las Vegas runs especially low on altimeter relative to other locales when winds are from E-SE-S

TEMPERATURE BY WIND DIRECTION:
* Variable winds are associated with higher temperatures than calm (speed 0) winds
* Winds from W-NW are generally associated with somewhat lower temperatures

DEW POINT BY WIND DIRECTION:
* Winds from SW-W-NW-N are associated with generally lower dew points than winds from NE-E-SE-S
* Las Vegas typically runs low on dew points while Houston and New Orleans typically run high on dew points
* San Diego has little variance in dew points and does not see any dip when winds are from SW-W-NW-N

The wind speed vs. sky obscuration plot is re-run with y-limits that suppress Newark VV:

plotMedianIQR(mod2016Data, 
              fctVar="wType", 
              numVar="WindSpeed", 
              subT="Red dots are the overall mid-quantile", 
              showXLabel=FALSE,
              diagnose=TRUE, 
              facetOn="sourceName",
              showCentral=TRUE, 
              ylimits=c(0, 15)
              )
## 
## Removing 593 records due to NA
## Warning: Removed 1 rows containing missing values (geom_crossbar).

Low vertical visibilities (VV) are generally associated with lower winds. There is also a slight tendency for more obscured skies (OVC) to be associated with slightly stronger winds.

Further, the previous analysis for counts by wind direction is re-run using the predominant directions:

# Counts by Metric for predomDir using mod2016Data
plotcountsByMetric(mod2016Data, 
                   mets=c("predomDir"), 
                   title="Comparisons Across Locales (red dots are the median)", 
                   facetOn="sourceName",
                   showCentral=TRUE
                   )

Findings include:
* Across all locales, winds are more commonly from S-SW-W-NW-N than from NE-E-SE
* San Diego is especially likely to have either 0 wind or wind from W-NW
* Traverse City is especially likely to have 0 wind or variable wind (very surprising, and possibly a sign of anomalous data)
* Minneapolis is especially likely to experience winds from NE
* New Orleans and Houston rarely experience winds from SW-W-NW relative to other locales
* Lincoln and Las Vegas are especially pronse to winds from S (and for Las Vegas SW)

There are several issues identified that should be explored and fixed if appropriate:

  • Vertical visibility greater than 10 recorded
  • Tendency for maximum obscuration to be CLR in Lincoln and Green Bay
  • Tendency for Traverse City to have no or variable winds

Cloud Data Exploration

Cloud exploration can highlight whether all locales are being compared apples to apples. This is especially the case if some locales have a different maximum sensor height:

# Select the source, sourceName, dtime and cLevel variables; pivot cLevel down
cLevels <- mod2016Data %>%
    select(source, sourceName, dtime, starts_with("cLevel")) %>%
    pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="height") %>%
    mutate(level=as.integer(str_replace(level, pattern="cLevel", replacement="")))

# Select the source, sourceName, dtime and cType variables; pivot cLevel down
cTypes <- mod2016Data %>%
    select(source, sourceName, dtime, starts_with("cType")) %>%
    pivot_longer(-c(source, sourceName, dtime), names_to="level", values_to="type") %>%
    mutate(level=as.integer(str_replace(level, pattern="cType", replacement="")))

cData <- cLevels %>%
    inner_join(cTypes, by=c("source", "sourceName", "dtime", "level"))

# Plot cloud heights, using only non-NA
cData %>%
    filter(!is.na(height)) %>%
    ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), y=height)) + 
    geom_violin(fill="lightblue") + 
    coord_flip() + 
    labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")

There are clearly differences in maximum cloud height recorded by locale:

  • 4 locales record clouds up to 35,000 feet
  • 5 locales record clouds up to 30,000 feet; there appear to be few if any clouds recorded above 30,000 feet, so cloud distributions may not be meaningfully impacted by this
  • 3 locales record clouds up to 25,000 feet; there are meaningful clouds between 25,000 and 30,000 feet, so these locales may appear to be “more clear than normal” purely due to maximum recorded height
  • 2 locales record heights up to 12,000 feet (Lincoln and Green Bay); these are the cities that appeared anomalously clear in the EDA, and the reason is thet they exclude mid/high level cloudiness

The distribution of cloud types observed can also be assessed:

# Plot cloud heights, using only non-""
fctLayers <- c("VV", "OVC", "BKN", "SCT", "FEW")

cData %>%
    filter(type!="") %>%
    mutate(type=factor(type, levels=fctLayers)) %>%
    ggplot(aes(x=fct_reorder(sourceName, height, .fun=max, na.rm=TRUE), fill=type)) + 
    geom_bar(position="stack") + 
    coord_flip() + 
    labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale") + 
    scale_fill_discrete("", rev(fctLayers)) + 
    theme(legend.position="bottom")

Lincoln and Green Bay stand out for having fewer clouds, likely due to the inability to catch the higher altocumulus and any cirrus cloud (both very common in the mid-latitudes).

Supposing that only clouds of 12,000 feet and under are considered:

# Plot cloud heights, using only non-NA
cData %>%
    filter(!is.na(height)) %>%
    filter(height <= 12000) %>%
    ggplot(aes(x=sourceName, y=height)) + 
    geom_violin(fill="lightblue") + 
    coord_flip() + 
    labs(x="", y="Cloud Height (feet)", title="Density of cloud heights by locale")

cData %>%
    filter(type!="") %>%
    filter(height <= 12000) %>%
    mutate(type=factor(type, levels=fctLayers)) %>%
    ggplot(aes(x=fct_reorder(sourceName, sourceName, .fun=length), fill=type)) + 
    geom_bar(position="stack") + 
    coord_flip() + 
    labs(x="", y="Cloud Layer Obscuration", title="Cloud obscuration by locale (up to 12,000 feet)") + 
    scale_fill_discrete("", rev(fctLayers)) + 
    theme(legend.position="bottom")

The patterns are much more plausible:

  • San Diego, known for marine layer, has clouds that skew very low and more obscured
  • Las Vegas, a desert, has clouds that skew higher and less obscured
  • The remaining cities have maximum cloud densities in the 2500-5000 foot range, very common cloud heights in the US
  • Lincoln and Green Bay remain among the least cloudy locales, though this may be due to hitting OVC (cloud layers above an OVC layer are not observed by the sensor/human and thus not recorded)

Further investigation of the cloud data may be interesting.

A function is written to take only cloud data up through height x, and to add a layer of clouds that are “clear” at a height that is out-of-interval:

# Filter to only clouds up to and including height
cloudsLevel0 <- function(df, 
                         maxHeight, 
                         byVars,
                         baseLevel=0,
                         heightBase=-100,
                         typeBase="CLR"
                         ) {
    
    # Function assumptions
    # Input data are unique by byVars-level and with columns 'height' and 'type'
    # Clouds increase in height with level
    # Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
    
    # FUNCTION ARGUMENTS:
    # df: tibble or dataframe contiaining the clouds data
    # maxHeight: the maximum height to consider (delete all heights above this level)
    # byVars: the variables that make up a unique observation (df should be unique by byVars-level)
    # baseLevel: the level to be created as the base level (by default, level 0)
    # heightBase: the height to be provided to the base level (by default, -100 feet)
    # typeBase: the type of obscuration observed at the base level (by default, CLR)
    
    # Add a cloud level 0 that has height -100 (by default)
    # Include only levels where the cloud height is not NA
    # Include only levels where the cloud height is less than or equal to maxHeight
    modData <- df %>% 
        group_by_at(vars(all_of(byVars))) %>% 
        summarize(level=baseLevel, height=heightBase, type=typeBase) %>% 
        ungroup() %>% 
        bind_rows(df) %>% 
        arrange_at(vars(all_of(c(byVars, "level")))) %>% 
        filter(!is.na(height)) %>% 
        filter(height <= maxHeight)
    
    modData
    
}

The function is then run using the existing cData:

modCData <- cloudsLevel0(cData, maxHeight=12000, byVars=c("source", "sourceName", "dtime"))
modCData <- modCData %>%
    mutate(type=factor(type, levels=c("VV", "OVC", "BKN", "SCT", "FEW", "CLR")))
modCData
## # A tibble: 260,324 x 6
##    source    sourceName         dtime               level height type 
##    <chr>     <chr>              <dttm>              <dbl>  <dbl> <fct>
##  1 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     0   -100 CLR  
##  2 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     1   2500 BKN  
##  3 kdtw_2016 Detroit, MI (2016) 2015-12-31 00:53:00     2   5000 OVC  
##  4 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00     0   -100 CLR  
##  5 kdtw_2016 Detroit, MI (2016) 2015-12-31 01:53:00     1   2100 OVC  
##  6 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00     0   -100 CLR  
##  7 kdtw_2016 Detroit, MI (2016) 2015-12-31 02:53:00     1   2100 OVC  
##  8 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00     0   -100 CLR  
##  9 kdtw_2016 Detroit, MI (2016) 2015-12-31 03:53:00     1   2500 OVC  
## 10 kdtw_2016 Detroit, MI (2016) 2015-12-31 04:53:00     0   -100 CLR  
## # ... with 260,314 more rows

An additional function is written to take processed cloud data and to designate 1) the minimum cloud height, 2) the minimum cloud ceiling (a ceiling exists with BKN, OVC, or VV layers), and 3) the maximum obscuration level (using CLR if no clouds exist in that area):

# Helper function to pull out the minimum cloud height, limited to certain obscurations
getMinimumHeight <- function(df, 
                             byVars, 
                             types, 
                             baseLevel=0
                             ) {
    
    # Split the data in to the baseLevel and all other levels
    baseData <- df %>%
        filter(level==baseLevel)
    layerData <- df %>%
        filter(level!=baseLevel)
        
    # Take the layerData, limit to type in types, and find the minimum level
    layerData <- layerData %>%
        filter(type %in% types) %>%
        group_by_at(vars(all_of(byVars))) %>%
        filter(level==min(level)) %>%
        ungroup()
    
    # Put the data back together
    # Keep the maximum level for each set of byVars (will be 0 if no data for byVars in layerData)
    cloudData <- baseData %>%
        bind_rows(layerData) %>%
        arrange_at(vars(all_of(c(byVars, "level")))) %>%
        group_by_at(vars(all_of(byVars))) %>%
        filter(level==max(level))
    
}

# Extract the minimum cloud height, minimum ceiling height, and maximum obscuration
hgtCeilObsc <- function(df, 
                        byVars,
                        baseLevel=0
                        ) {

    # Function assumptions
    # Input data are unique by byVars-level and with columns 'height' and 'type'
    # For each byVars, a row with level=baseLevel, height=heightBase, type=typeBase has been created
    # Clouds increase in height with level
    # Clouds are non-decreasing in type (VV > OVC > BKN > SCT > FEW) with level
    
    # FUNCTION ARGUMENTS:
    # df: tibble or dataframe contiaining the clouds data
    # byVars: the variables that make up a unique observation (df should be unique by byVars-level)
    # baseLevel: the base level in df (by default, level 0)
    # heightBase: the height of the base level in df (by default, -100 feet)
    # typeBase: the type of obscuration at the base level in df (by default, CLR)
    
    # Get the maximum obscuration
    maxObsc <- df %>%
        group_by_at(vars(all_of(byVars))) %>% 
        filter(level==max(level)) %>%
        ungroup()
    
    # Get the minimum height (any type)
    minHeight <- getMinimumHeight(df, 
                                  byVars=byVars, 
                                  types=c("VV", "OVC", "BKN", "SCT", "FEW"), 
                                  baseLevel=baseLevel
                                  )
    
    # Get the minimum ceiling height (VV, OVC, BKN)
    minCeiling <- getMinimumHeight(df, 
                                   byVars=byVars, 
                                   types=c("VV", "OVC", "BKN"), 
                                   baseLevel=baseLevel
                                   )

    # Put the file together
    minCeiling <- minCeiling %>%
        rename(ceilingHeight=height, ceilingType=type, ceilingLevel=level)
    minHeight <- minHeight %>%
        rename(cloudHeight=height, cloudType=type, cloudLevel=level)
    maxObsc <- maxObsc %>%
        rename(obscHeight=height, obscType=type, obscLevel=level)
    
    # Merge
    cloudSummary <- maxObsc %>%
        full_join(minHeight, by=byVars) %>%
        full_join(minCeiling, by=byVars)
    
    cloudSummary
    
}

The function can then be run on the modified clouds data:

# Get the key clouds data
cloudSummary <- hgtCeilObsc(modCData, byVars=c("source", "sourceName", "dtime"))

# Check for consistency
cloudSummary %>%
    count(ceilingType, obscType)
## # A tibble: 7 x 3
##   ceilingType obscType     n
##   <fct>       <fct>    <int>
## 1 VV          VV         761
## 2 OVC         OVC      22654
## 3 BKN         OVC      10054
## 4 BKN         BKN      17198
## 5 CLR         SCT      12456
## 6 CLR         FEW      22711
## 7 CLR         CLR      46309
cloudSummary %>%
    count(cloudType, ceilingType)
## # A tibble: 10 x 3
##    cloudType ceilingType     n
##    <fct>     <fct>       <int>
##  1 VV        VV            761
##  2 OVC       OVC         17900
##  3 BKN       BKN         15436
##  4 SCT       OVC          1848
##  5 SCT       BKN          4936
##  6 SCT       CLR          8243
##  7 FEW       OVC          2906
##  8 FEW       BKN          6880
##  9 FEW       CLR         26924
## 10 CLR       CLR         46309
cloudSummary %>%
    count(cloudType, obscType)
## # A tibble: 12 x 3
##    cloudType obscType     n
##    <fct>     <fct>    <int>
##  1 VV        VV         761
##  2 OVC       OVC      17900
##  3 BKN       OVC       6384
##  4 BKN       BKN       9052
##  5 SCT       OVC       3444
##  6 SCT       BKN       3340
##  7 SCT       SCT       8243
##  8 FEW       OVC       4980
##  9 FEW       BKN       4806
## 10 FEW       SCT       4213
## 11 FEW       FEW      22711
## 12 CLR       CLR      46309

Plots for the maximum obscuration (through 12,000 feet) can then be created:

plotMaxObsc <- function(df, 
                        xVar, 
                        fillVar, 
                        title, 
                        subtitle="Up to and including 12,000 feet",
                        orderByVariable=NULL,
                        orderByValue=NULL,
                        posnBar="stack",
                        yLabel="# Hourly Observations",
                        legendLabel="",
                        facetOn=NULL
                        ) {

    # Get the levels to be used
    cLevels <- levels(df %>% pull(fillVar))
    
    # Create the main plot
    p1 <- df %>%
        ggplot(aes_string(fill=fillVar))
    if (!is.null(orderByVariable)) {
        p1 <- p1 + 
            geom_bar(aes(x=fct_reorder(get(xVar), get(orderByVariable)==orderByValue, .fun=sum)),
                     position=posnBar
                     )
    } else {
        p1 <- p1 + 
            geom_bar(aes_string(x=xVar), position=posnBar)
    }
    p1 <- p1 + 
        coord_flip() + 
        labs(x="", 
             y=yLabel, 
             title=title, 
             subtitle=subtitle
             ) + 
        theme(legend.position="bottom") + 
        scale_fill_discrete(legendLabel, rev(cLevels)) + 
        guides(fill=guide_legend(nrow=1))
    if (!is.null(facetOn)) {
        p1 <- p1 + facet_wrap(as.formula(paste("~", facetOn)))
    }
    print(p1)
    
}

# Cloud obscuration by source
plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="obscType", 
            title="Maximum Cloud Obscuration", 
            orderByVariable="obscType",
            orderByValue="CLR"
            )

# Cloud obscuration by month
cloudSummary <- cloudSummary %>%
    mutate(month=lubridate::month(dtime), 
           hour=lubridate::hour(dtime), 
           monthfct=factor(month.abb[month], levels=month.abb[1:12])
           )
plotMaxObsc(cloudSummary, 
            xVar="monthfct", 
            fillVar="obscType", 
            title="Maximum Cloud Obscuration", 
            posnBar="fill"
            )

A few salient observations stand out about the maximum obscuration level:

  • Las Vegas is almost always clear or has just a few clouds up to 12,000 feet
  • Lincoln and Green Bay tend to be either clear up to 12,000 feet or to have an overcast
  • Traverse City is especially prone to being overcast and not having clear skied to 12,000 feet
  • Clear skies to 12,000 feet are roughly as likely in any month, but overcast by 12,000 feet is much more common in the cold season while FEW/SCT are more common during the warm season

Groups of cities can be examined, faceted by month:

cityCloudList <- list(c("klas_2016", "ksan_2016", "kiah_2016", "kmsy_2016"), 
                      c("kgrb_2016", "kgrr_2016", "kdtw_2016", "ktvc_2016"), 
                      c("klnk_2016", "kmsp_2016", "kmsn_2016", "kind_2016"), 
                      c("kmke_2016", "kord_2016", "kewr_2016")
                      )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="obscType", 
                title="Maximum Cloud Obscuration", 
                facetOn="sourceName", 
                posnBar="fill"
                )
}

A few findings include:

  • While there may be some small seasonal patterns, observations in Las Vegas are almost always CLR/FEW
  • Houston and New Orleans both have few clear observations and few overcasts during June-September
  • San Diego has most of its overcasts during May-Septemver and most of its clear observations from October-April
  • Most of the remaining locales show a seasonal pattern with overcasts more common during the cold season
  • Newark is at a glance less seasonal than the midwestern locales

The ceiling heights can also be assessed:

cloudSummary <- cloudSummary %>%
    mutate(ceilFactor=factor(case_when(ceilingHeight == -100 ~ "None", 
                                       ceilingHeight <= 1000 ~ "0-1000", 
                                       ceilingHeight <= 3000 ~ "1000-3000", 
                                       ceilingHeight <= 6000 ~ "3000-6000",
                                       ceilingHeight <= 12000 ~ "6000-12000"
                                       ), 
                             levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
                             )
           )

plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="ceilFactor", 
            title="Ceiling Height", 
            orderByVariable="ceilFactor",
            orderByValue="None"
            )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="ceilFactor", 
                title="Ceiling Height (proportion of hourly observations)", 
                facetOn="sourceName", 
                posnBar="fill", 
                yLabel="", 
                legendLabel="Ceiling Height (feet)"
                )
}

Findings for ceiling height broadly line up with findings for maximum cloud obscuration, as expected:

  • Las Vegas is least likely to have a ceiling under 12,000 feet, followed by New Orleans, Lincoln, and Houston
  • San Diego is especially likely to have a ceiling of 1000-3000 feet relative to other locales, likely driven by marine layer

The same analysis can be run for the minimum cloud height:

cloudSummary <- cloudSummary %>%
    mutate(minCFactor=factor(case_when(cloudHeight == -100 ~ "None", 
                                       cloudHeight <= 1000 ~ "0-1000", 
                                       cloudHeight <= 3000 ~ "1000-3000", 
                                       cloudHeight <= 6000 ~ "3000-6000",
                                       cloudHeight <= 12000 ~ "6000-12000"
                                       ), 
                             levels=c("None", "6000-12000", "3000-6000", "1000-3000", "0-1000")
                             )
           )

plotMaxObsc(cloudSummary, 
            xVar="sourceName", 
            fillVar="minCFactor", 
            title="Minimum Cloud Height", 
            orderByVariable="minCFactor",
            orderByValue="None"
            )

for (x in cityCloudList) {
    cloudUse <- cloudSummary %>%
        filter(source %in% x)
    plotMaxObsc(cloudUse, 
                xVar="monthfct", 
                fillVar="minCFactor", 
                title="Minimum Cloud Height (proportion of hourly observations)", 
                facetOn="sourceName", 
                posnBar="fill", 
                yLabel="", 
                legendLabel="Minimum Cloud Height (feet)"
                )
}

At a glance, there seem to be some similarities and differences among the locales:

  • San Diego frequently has its lowest clouds in the 0-3000 feet range, particularly during the warm season
  • Las Vegas almost never has clouds below 6000 feet in any month
  • Houston and New Orleans look very similar in minimum cloud heights by month
  • Grand Rapids, Traverse City, Detroit, Indianapolis, Minneapolis, Chicago, Milwaukee, and Madison look similar in minimum cloud heights
  • Green Bay and Lincoln appear to have more occurences of clear skies up to 12000 feet than the other cold weather cities
  • Newark seems somewhat close to several different groupings of cities

A more analytical approach would look at the distance between the various locales. Broadly speaking, distance can be calculated based on counts by month by locale. No scaling is performed since each location has roughly the same number of observations by month, and the intent is to find macro similarities (e.g., if 1 city has 2% data in bucket x and the others all have 0% data in bucket x, this analysis would treat that as a minor difference where with scaled data it would be a primary difference driver).

There are three main variable to consider for distance:

  • obscType - the obscuration type observed
  • minCFactor - minimum cloud height observed, bucketed as 0-1000, 1000-3000, 3000-6000, 6000-12000, None
  • ceilFactor - ceiling height observed, bucketed as 0-1000, 1000-3000, 3000-6000, 6000-12000, None

Distances can be calculated using any or all of these variables. A function is created to that process can be repeated:

findCloudDist <- function(df, 
                          byVar, 
                          fctVar,
                          pivotVar="monthfct", 
                          scaleDistData=FALSE, 
                          returnPivotOnly=FALSE
                          ) {
    
    # FUNCTION ARGUMENTS
    # df: data frame or tibble containing one record per locale and time
    # byVar: the variable(s) by which the final distance file should be unique
    # fctVar: the factor variables to be counted up, with the sums being the distance inputs
    # pivotVar: the variable(s) which should be pivoted in to columns to make the file unique by byVar
    # scale: whether to scale the data prior to calculating distances
    # returnDistMatrix: whether to just return the pivoted data frame (default, FALSE, returns the distance matrix calculated from this pivoted data frame rather than the data frame itself)
    
    # Create the counts data by byVar, pivoted by fctVar and pivotVar
    baseData <- df %>%
        group_by_at(vars(all_of(c(byVar, fctVar, pivotVar)))) %>%
        summarize(n=n()) %>%
        ungroup() %>%
        pivot_wider(names_from=all_of(c(fctVar, pivotVar)), values_from="n") %>%
        mutate_if(is.numeric, tidyr::replace_na, replace=0)
    
    # If the data are only yo be pivoted, return and exit the function
    if (returnPivotOnly) {
        return(baseData)
    }
    
    # Split in to descriptors and data
    descData <- baseData %>%
        select_at(vars(all_of(byVar))) %>%
        mutate(rowN=row_number())
    distData <- baseData %>%
        select_at(vars(-any_of(byVar))) %>%
        as.matrix()
    
    # Scale distdata if requested
    if (scaleDistData) {
        distData <- scale(distData)
    }
    
    # Create the distances, convert back to data frame, pivot_longer, and attach the labels
    dist(distData) %>%
        as.matrix() %>%
        as_tibble() %>%
        mutate(row1=row_number()) %>%
        pivot_longer(-row1, names_to="row2", values_to="dist") %>%
        mutate(row2=as.integer(row2)) %>%
        inner_join(descData, by=c("row1"="rowN")) %>%
        rename_at(vars(all_of(byVar)), ~paste0(., "_1")) %>%
        inner_join(descData, by=c("row2"="rowN")) %>%
        rename_at(vars(all_of(byVar)), ~paste0(., "_2"))
    
}

The function is then run for the obscuration data as an example:

# Find the distances for obscType by locale vs. locale
obscDist <- findCloudDist(cloudSummary, 
                          byVar=c("source", "sourceName"), 
                          fctVar="obscType", 
                          pivotVar="monthfct"
                          )
obscDist
## # A tibble: 225 x 7
##     row1  row2  dist source_1  sourceName_1       source_2 sourceName_2         
##    <int> <int> <dbl> <chr>     <chr>              <chr>    <chr>                
##  1     1     1    0  kdtw_2016 Detroit, MI (2016) kdtw_20~ Detroit, MI (2016)   
##  2     1     2  627. kdtw_2016 Detroit, MI (2016) kewr_20~ Newark, NJ (2016)    
##  3     1     3  669. kdtw_2016 Detroit, MI (2016) kgrb_20~ Green Bay, WI (2016) 
##  4     1     4  294. kdtw_2016 Detroit, MI (2016) kgrr_20~ Grand Rapids, MI (20~
##  5     1     5  726. kdtw_2016 Detroit, MI (2016) kiah_20~ Houston, TX (2016)   
##  6     1     6  347. kdtw_2016 Detroit, MI (2016) kind_20~ Indianapolis, IN (20~
##  7     1     7 1421. kdtw_2016 Detroit, MI (2016) klas_20~ Las Vegas, NV (2016) 
##  8     1     8 1092. kdtw_2016 Detroit, MI (2016) klnk_20~ Lincoln, NE (2016)   
##  9     1     9  361. kdtw_2016 Detroit, MI (2016) kmke_20~ Milwaukee, WI (2016) 
## 10     1    10  345. kdtw_2016 Detroit, MI (2016) kmsn_20~ Madison, WI (2016)   
## # ... with 215 more rows

A function can then be created to plot the distances as a heamap and to return the minimum distance for each locale:

plotCloudDist <- function(df, 
                          var1="sourceName_1", 
                          var2="sourceName_2", 
                          met="dist", 
                          roundDist=0, 
                          title="Distance Between Locales", 
                          subT=""
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: tibble or data frame containing distance data
    # var1: the variable containing the first locale
    # var2: the variable containing the second locale
    # dist: the variable containing the pre-calculated distance between var1 and var2
    # roundDist: the rounding for the distance in the plot

    # Process the data frame and exclude any occurences of distance to self    
    distData <- df %>%
        select_at(vars(all_of(c(var1, var2, met)))) %>%
        filter(get(var1) != get(var2))
    
    # Get the locales by minimum distance to any other locale
    distHiLo <- distData %>%
        group_by_at(vars(all_of(var1))) %>%
        filter(dist==min(dist)) %>%
        arrange(dist) %>%
        pull(var1)
    
    # Create a heatmap of the distances
    distData %>%
        ggplot(aes(x=factor(get(var1), levels=distHiLo), y=factor(get(var2), levels=distHiLo))) + 
        geom_tile(aes(fill=dist)) + 
        geom_text(aes(label=round(dist, roundDist)), color="lightblue") +
        labs(x="", y="", title=title, subtitle=subT) + 
        scale_fill_gradient("Distance", low="black", high="white") + 
        theme(axis.text.x=element_text(angle=90))
    
}

The process can then be run on the obscuration data as a check:

plotCloudDist(obscDist, subT="Based on % of obscuration type by month")

FINDINGS FOR OBSCURATION DISTANCE:

  • Las Vegas is far from everything, with its only modestly near neighbor being Lincoln
  • Lincoln is far from everything except for its moderately near neighbors Las Vegas and Green Bay
  • San Diego is far from everything except its moderately near neighbors Houston, New Orleans, and Newark
  • Green Bay is closest to Minneapolis and is moderately close to most of the midwest cities
  • Newark is reasonably close to many cities, being distant only from Traverse City, Green Bay, Lincoln, and Las Vegas
  • Houston and New Orleans are closest to each other, then to Newark
  • Traverse City and Grand Rapids are relatively close to each other
  • The remaining cold weather cities (Milwaukee, Chicago, Madison, Minneapolis, Indianapolis, Detroit) are all relatively close to each other and also to Grand Rapids

This is suggestive that there is a cold-weather cluster, a hot and humid cluster, and then several locales that are more or less standalone but could be grouped loosely to each other.

The functions can then be run on variations of the data:

# Run for minCFactor
findCloudDist(cloudSummary, 
              byVar=c("source", "sourceName"), 
              fctVar="minCFactor", 
              pivotVar="monthfct"
              ) %>%
    plotCloudDist(subT="Based on % in each minimum cloud height bucket by month")

# Run for ceilFactor
findCloudDist(cloudSummary, 
              byVar=c("source", "sourceName"), 
              fctVar="ceilFactor", 
              pivotVar="monthfct"
              ) %>%
    plotCloudDist(subT="Based on % in each ceiling height bucket by month")

Findings by minimum cloud height are similar to findings by obscuration.

Findings by ceiling height show that Lincoln and Newark are close to each other and that San Diego and Las Vegas are both segments of one. Traverse City and Grand Rapids are close to each other; as are New Orleans and Houston. There continues to be a large, close, segment of cold cities.

Of interest is whether a simple kmeans analysis returns the same findings:

set.seed(2006040940)

ceilDistData <- findCloudDist(cloudSummary, 
                              byVar=c("source", "sourceName"), 
                              fctVar="ceilFactor", 
                              pivotVar="monthfct", 
                              returnPivotOnly=TRUE
                              )

tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=5, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Grand Rapids, MI (2016)        5
##  2 Traverse City, MI (2016)       5
##  3 Las Vegas, NV (2016)           4
##  4 Newark, NJ (2016)              3
##  5 Houston, TX (2016)             3
##  6 Lincoln, NE (2016)             3
##  7 New Orleans, LA (2016)         3
##  8 Detroit, MI (2016)             2
##  9 Green Bay, WI (2016)           2
## 10 Indianapolis, IN (2016)        2
## 11 Milwaukee, WI (2016)           2
## 12 Madison, WI (2016)             2
## 13 Minneapolis, MN (2016)         2
## 14 Chicago, IL (2016)             2
## 15 San Diego, CA (2016)           1

The five-segment k-means is consistent - solo segments for San Diego and Las Vegas; a Grand Rapids-Traverse City segment; a cold-weather midwestern segment (Grand Rapids and Traverse City are on the downwind side of Lake Michigan and would be expected to see much different clouds than cities with similar temperatures); and a catch-all segment.

The data can also be assessed using hierarchical clustering:

hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="complete") %>%
    plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=complete")

hclust(dist(ceilDistData[3:ncol(ceilDistData)]), method="single") %>%
    plot(labels=ceilDistData$sourceName, cex=0.5, main="Hierarchical on Ceiling Height: method=single")

Similar conclusions can be drawn from hierarchical clustering based on ceiling height buckets by month:

  • San Diego and Las Vegas are each segments of one
  • Houston and New Orleans tend to associate together
  • Grand Rapids and Traverse City tend to associate together
  • Lincoln and Newark tend to associate together
  • There are possibly two sub-groupings of the other cold weather locales

This is suggestive of either 6 or 7 segments in this data. Does k-means match up?

tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=6, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Las Vegas, NV (2016)           6
##  2 San Diego, CA (2016)           5
##  3 Newark, NJ (2016)              4
##  4 Lincoln, NE (2016)             4
##  5 Houston, TX (2016)             3
##  6 New Orleans, LA (2016)         3
##  7 Grand Rapids, MI (2016)        2
##  8 Traverse City, MI (2016)       2
##  9 Detroit, MI (2016)             1
## 10 Green Bay, WI (2016)           1
## 11 Indianapolis, IN (2016)        1
## 12 Milwaukee, WI (2016)           1
## 13 Madison, WI (2016)             1
## 14 Minneapolis, MN (2016)         1
## 15 Chicago, IL (2016)             1
tibble::tibble(locale=ceilDistData$sourceName, 
               cluster=kmeans(dist(ceilDistData[3:ncol(ceilDistData)]), centers=7, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Houston, TX (2016)             7
##  2 New Orleans, LA (2016)         7
##  3 Detroit, MI (2016)             6
##  4 Green Bay, WI (2016)           6
##  5 Minneapolis, MN (2016)         6
##  6 Newark, NJ (2016)              5
##  7 Lincoln, NE (2016)             5
##  8 Grand Rapids, MI (2016)        4
##  9 Traverse City, MI (2016)       4
## 10 Indianapolis, IN (2016)        3
## 11 Milwaukee, WI (2016)           3
## 12 Madison, WI (2016)             3
## 13 Chicago, IL (2016)             3
## 14 San Diego, CA (2016)           2
## 15 Las Vegas, NV (2016)           1

Very nicely aligned. The addition of the sixth segment splits the catch-all segment (from the previous 5 cluster analysis) in to a Houston-New Orleans segment and a Lincoln-Newark segment. The addition of the seventh segment splits the cold weather cities in to Chicago-Madison-Milwukee-Indianapolis; and Detroit-Green Bay-Minneapolis.

Are any differences observed in trends for minimum cloud height?

heightDistData <- findCloudDist(cloudSummary, 
                                byVar=c("source", "sourceName"), 
                                fctVar="minCFactor", 
                                pivotVar="monthfct", 
                                returnPivotOnly=TRUE
                                )

hclust(dist(heightDistData[3:ncol(heightDistData)]), method="complete") %>%
    plot(labels=heightDistData$sourceName, cex=0.5, 
         main="Hierarchical on Minimum Cloud Height: method=complete"
         )

hclust(dist(heightDistData[3:ncol(heightDistData)]), method="single") %>%
    plot(labels=heightDistData$sourceName, cex=0.5, 
         main="Hierarchical on Minimum Cloud Height: method=single"
         )

There are some differences when using the ‘complete’ linkage method:

  • There is a Green Bay-Lincoln cluster, with Las Vegas merging in relatively soon
  • There are arguably two cold weather clusters (including Grand Rapids and Traverse City), with Newark merging in relatively soon
  • There is a Houston-New Orleans segment, with San Diego merging in relatively soon

When using the ‘single’ linkage method, these patterns become muted:

  • There are still arguably two cold-weather clusters
  • Houston and New Orleans form a cluster
  • San Diego and Las Vegas are their own cluster
  • Lincoln, Newark, and Green Bay all fall somewhere in the middle

We would expect k-means to pull out 1-2 cold-weather clusters and a Houston-New Orleans cluster:

set.seed(2006041003)

tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=5, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Las Vegas, NV (2016)           5
##  2 Lincoln, NE (2016)             5
##  3 Houston, TX (2016)             4
##  4 New Orleans, LA (2016)         4
##  5 San Diego, CA (2016)           3
##  6 Detroit, MI (2016)             2
##  7 Grand Rapids, MI (2016)        2
##  8 Indianapolis, IN (2016)        2
##  9 Milwaukee, WI (2016)           2
## 10 Madison, WI (2016)             2
## 11 Minneapolis, MN (2016)         2
## 12 Chicago, IL (2016)             2
## 13 Traverse City, MI (2016)       2
## 14 Newark, NJ (2016)              1
## 15 Green Bay, WI (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=6, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Houston, TX (2016)             6
##  2 New Orleans, LA (2016)         6
##  3 Newark, NJ (2016)              5
##  4 Green Bay, WI (2016)           5
##  5 San Diego, CA (2016)           4
##  6 Detroit, MI (2016)             3
##  7 Grand Rapids, MI (2016)        3
##  8 Indianapolis, IN (2016)        3
##  9 Milwaukee, WI (2016)           3
## 10 Madison, WI (2016)             3
## 11 Minneapolis, MN (2016)         3
## 12 Chicago, IL (2016)             3
## 13 Traverse City, MI (2016)       3
## 14 Lincoln, NE (2016)             2
## 15 Las Vegas, NV (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=7, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 Indianapolis, IN (2016)        7
##  2 Milwaukee, WI (2016)           7
##  3 Madison, WI (2016)             7
##  4 Minneapolis, MN (2016)         7
##  5 Chicago, IL (2016)             7
##  6 Lincoln, NE (2016)             6
##  7 Newark, NJ (2016)              5
##  8 Green Bay, WI (2016)           5
##  9 Las Vegas, NV (2016)           4
## 10 Detroit, MI (2016)             3
## 11 Grand Rapids, MI (2016)        3
## 12 Traverse City, MI (2016)       3
## 13 Houston, TX (2016)             2
## 14 New Orleans, LA (2016)         2
## 15 San Diego, CA (2016)           1
tibble::tibble(locale=heightDistData$sourceName, 
               cluster=kmeans(dist(heightDistData[3:ncol(heightDistData)]), centers=8, nstart=1000)$cluster
               ) %>%
    arrange(-cluster)
## # A tibble: 15 x 2
##    locale                   cluster
##    <chr>                      <int>
##  1 San Diego, CA (2016)           8
##  2 Indianapolis, IN (2016)        7
##  3 Milwaukee, WI (2016)           7
##  4 Madison, WI (2016)             7
##  5 Minneapolis, MN (2016)         7
##  6 Chicago, IL (2016)             7
##  7 Houston, TX (2016)             6
##  8 New Orleans, LA (2016)         6
##  9 Las Vegas, NV (2016)           5
## 10 Detroit, MI (2016)             4
## 11 Grand Rapids, MI (2016)        4
## 12 Traverse City, MI (2016)       4
## 13 Newark, NJ (2016)              3
## 14 Green Bay, WI (2016)           2
## 15 Lincoln, NE (2016)             1

With five segments, there is:

  • San Diego
  • Houston and New Orleans
  • Newark and Green Bay
  • Las Vegas and Lincoln
  • Cold weather cities

Adding the sixth cluster splits apart Las Vegas and Lincoln. Adding the seventh cluster cleaves Grand Rapids-Traverse City-Detroit from the other cold weather cities. Adding the eighth cluster splits apart Green Bay and Newark.

Precipitation

METAR data include descriptions of the precipitation occuring at any given time. Two of the most common precipitation forms are rain (RA) and SN(). These can occur together, denoted as RASN or SNRA in the METAR.

Further, the precipitation type can be classified using a prefix as light (-), moderate (no prefix), or heavy (+). So, RA would be moderate rain, -SNRA would be a light snow-rain mix, +RA would be heavy rain.

Additionally, the timing of the precipitation event is captured in the remarks using B (begin) and E (end). So, an hourly METAR of RAB20E35B50 would mean rain started at 20 past the hour, ended at 35 past the hour, and began again at 50 past the hour. Since METAR are often taken just before the top of the hour, a four-digit time is used if it is in the ‘previous’ hour; for example, RAB1959E36 in the 2053Z METAR.

Broadly speaking, precipitation at a specific point in time can be extracted from the main METAR (preceding RMK) whil prcipitation intervals can be extracted by parsing the remarks (following RMK). There can be misalignment between these; for example, an interval that suggests rain is occuring at a time the METAR does not show rain.

There are two key functions for the extraction process:

  • suggestBeginEndTimes() - a function that compares the METAR and remarks and suggests precipitation begin timea that should be added or excluded, and precipitation end times that should be added or excluded
  • runFullPrecipExtraction() - a function that extracts relevant precipitation information, with options for begin/end times to be added or excluded

Helper function include:

  • extractPrecipData() - extracts the precipitation begin and end times from a processed METAR file
  • getBeginendTimeMatrix() - get the beginning and ending times as a matrix
  • getBeginEndTimeVector() - extract the beginning or ending times from a column of the matrix
  • extractTime() - extract the time as a zero-padded number from a named vector
  • createPrecipInterval() - create the precipitation time intervals, and flag any that are negative or very long
  • createPrecipIntervalPlots() - distributions of precipitation events, precipitation by day, and precipitation by month

Example code includes:

# Extract the precipitation begin and end times from a processed METAR file
extractPrecipData <- function(df, 
                              pType, 
                              showRegex=TRUE, 
                              showSTR=FALSE, 
                              showTable=FALSE,
                              dupError=TRUE
                              ) {

    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble
    # pType: the regex code to be used
    # showRegex: boolean, whether to print out the regex code used
    # showSTR: boolean, whether to show the file str()
    # showTable: boolean, whether to show the file counts of start and end times
    # dupError: boolean, if TRUE will throw an error if any duplicates are found; if FALSE just notes it
    
    # Extract the METAR and the date-time from a processed list
    procMET <- df %>% 
        select(origMETAR, dtime)

    # Check whether there are comments for the desired pType either beginning or ending
    keyPattern <- paste0("(", pType, "[B|E]\\d+[0-9BE]*)")
    if (showRegex) { cat("\nRegex search code is:", keyPattern, "\n\n") }
    
    procMET <- procMET %>% 
        mutate(precipData=str_extract(origMETAR, pattern=keyPattern), 
               isPrecip=grepl(paste0(pType, "[B|E]"), origMETAR, perl=TRUE), 
               nBegin=pmax(0, str_count(precipData, "B"), na.rm=TRUE), 
               nEnd=pmax(0, str_count(precipData, "E"), na.rm=TRUE), 
               dateUTC=lubridate::date(dtime), 
               hourUTC=lubridate::hour(dtime)
               )
    if (showSTR) {
        str(procMET)
        cat("\n")
    }

    # Check the counts of precipitation beginning and rain ending
    if (showTable) {
        procMET %>%
            count(isPrecip, nBegin, nEnd) %>%
            print()
    }

    # Check that the file is unique by time
    dupAns <- procMET %>%
        select(dateUTC, hourUTC) %>%
        duplicated %>%
        any()

    if (dupError) {
        if (dupAns) { stop("Duplicates observed, investigate and fix\n") }
    } else {
        cat("\nAre there any problems with duplicated keys?", dupAns, "\n\n")
    }
    
    procMET
    
}


# Helper function to extract the beginning and ending times using str_extract_all
getBeginEndTimeMatrix <- function(file, 
                                  pullVar="precipData", 
                                  pState="B"
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # file: file containing data
    # pullVar: the variable of interest in file
    # pState: the precipitation state of interest - B (begin) or E (end)
    
    file %>%
        pull(pullVar) %>%
        str_extract_all(paste0(pState, "\\d+"), simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE)
    
}


getBeginEndTimeVector <- function(timeExtractFile, 
                                  origFullFile, 
                                  extractVar, 
                                  extractSym
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # timeExtractFile: the file containing the extracted begin and end times
    # origFullFile: the file that the extraction was run from (includes date and time)
    # extractVar: the variable name for the column to be extracted from
    # extractSym: whether it is B (begin) or E (end)
    
    timeExtractFile %>%
        cbind(origFullFile[, c("dateUTC", "hourUTC")]) %>%
        mutate(hourChar=str_pad(str_trim(as.character(hourUTC)), width=2, pad="0")) %>%
        apply(1, FUN=extractTime, var=extractVar, sym=extractSym)
    
}


# Convert the rain begins data to the hour and minute associated to the UTC
extractTime <- function(x, 
                        var, 
                        sym="B"
                        ) {
    
    # FUNCTION ARGUMENTS:
    # x: the named vector
    # var: the variable name to extract
    # sym: whether the suppression needs to be on B (begin) or E (end) to convert to a zero-padded number
    
    # Extracting the time is tricky
    # If NA or blank, then the time is NA
    # Replace the leading B or E and there should then be a 2-digit or 4-digit number
    # If 2-digit, then append date amd hour in front of it
    # If 4-digit, then append date in front of it
    # Exception: if 4-digit and begins with 23, then need to use the previous day
    
    if (is.na(x[var]) | x[var]=="") {
        utcReturn <- NA
    }
    else {
        utcUse <- str_replace(x[var], sym, "")
        if (str_length(utcUse)==4) {
            utcReturn <- paste0(x["dateUTC"], " ", utcUse)
            # If a 4-digit time starts with 23 and is in the 0Z METAR, it is part of the previous day
            if(str_sub(utcUse, 1, 2)=="23" & as.numeric(x["hourChar"])==0) {
                utcReturn <- paste0(as.Date(x["dateUTC"]) - 1, " ", utcUse)
            }
        } else if (str_length(utcUse)==2) {
            utcReturn <- paste0(x["dateUTC"], " ", x["hourChar"], utcUse)
        } else {
            cat("\nCannot parse data: ", x, "\n", x[var], "\n", var, sym, utcUse)
            stop()
        }
    }
    utcReturn

}


suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE
                                 ) {

    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
               )

    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)

    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")

    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E")

    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
                   )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    sugExceptions <- sugExceptions %>%
        mutate(etob=begins > ends, 
               btoe=ends > begins, 
               needBegin=curPrecip & !lagPrecip & !etob, 
               needEnd=!curPrecip & lagPrecip & !btoe, 
               overBegin=etob & (lagPrecip | !curPrecip), 
               overEnd=btoe & (curPrecip | !lagPrecip)
               )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }

    # Flag potential issues
    cat("\nNeed Begin time\n")
    sugExceptions %>%
        filter(needBegin) %>%
        select(dtime, origMETAR) %>%
        print()

    cat("\nNeed End time\n")
    sugExceptions %>%
        filter(needEnd) %>%
        select(dtime, origMETAR) %>%
        print()

    cat("\nExtraneous Begin time\n")
    sugExceptions %>%
        filter(overBegin) %>%
        select(dtime, paste0("b", 1:nColMax)) %>%
        print()

    cat("\nExtraneous End time\n")
    sugExceptions %>%
        filter(overEnd) %>%
        select(dtime, paste0("e", 1:nColMax)) %>%
        print()

    cat("\nWrong amount of begins or ends\n")
    sugExceptions %>%
        mutate(absMatch=abs(begins-ends), absLag=lag(absMatch, 1), absLead=lead(absMatch, 1)) %>%
        filter(pmax(absMatch, absLag, absLead) > 1) %>%
        select(dtime, paste0("e", 1:nColMax), paste0("b", 1:nColMax)) %>%
        print()
    
    sugExceptions %>%
        filter(needBegin | needEnd | overBegin | overEnd | abs(begins - ends) > 1)
    
}

A function is then created to run the full precipitation extraction:

# Function to create the time intervals data
createPrecipInterval <- function(endVector, 
                                 beginVector, 
                                 endExclude=c(), 
                                 beginExclude=c(), 
                                 sState=FALSE, 
                                 nMinPrint=1, 
                                 maxProb=1000, 
                                 nMaxPrint=1
                                 ) {
    
    # FUNCTION ARGUMENTS:
    # endVector: vector of end times
    # beginVector: vector of begin times
    # endExclude: vector of end times to exclude
    # beginExclude: vector of begin times to exclude
    # sState: the starting precipitation state (TRUE for precipitation., FALSE for no precipitation)
    # nMinPrint: index for the first position of negative intervals to show (typically 1)
    # maxProb: maximum duration (minutes) of precipitation to flag a print out
    # nMaxPrint: index for the first position of very long intervals to show (typically 1)
    
    # If the starting state is one of precipitation, allow it to 'burn in' by deleting the first end time
    if(sState) { endVector <- endVector[2:length(endVector)] }
    
    # Create the interval data
    endsUse <- endVector[!(endVector %in% lubridate::ymd_hm(endExclude))]
    beginsUse <- beginVector[!(beginVector %in% lubridate::ymd_hm(beginExclude))]
    intervalData <- endsUse - beginsUse
    
    # Show a summary of the interval data
    print(summary(as.numeric(intervalData)))
    
    # If there are any non-positive intervals, print the data causing the first of them
    if (min(as.numeric(intervalData)) <= 0) {
        cat("\nProblem Detected - Intervals are not positive.  Data to help investigate\n")
        posns <- which(as.numeric(intervalData) <= 0)
        posns <- max(1, posns[1]-5):min(length(beginsUse), length(endsUse), posns[nMinPrint]+5)
        cat("\nVector of Begins\n")
        print(lubridate::as_datetime(beginsUse[posns]))
        cat("\nVector of Ends\n")
        print(lubridate::as_datetime(endsUse[posns]))
        cat("\n")
    }
    
    # If there are any very long positive intervals, print the data causing the first five of them
    if (max(as.numeric(intervalData)) >= maxProb) {
        cat("\nPotential problem Detected - very long.  Data to help investigate\n")
        posns <- which(as.numeric(intervalData) >= maxProb)
        cat("\nPositions with problems are:", posns)
        posns <- max(1, posns[1]-5):min(length(beginsUse), length(endsUse), posns[min(length(posns), nMaxPrint)]+5)
        cat("\nVector of Begins\n")
        print(lubridate::as_datetime(beginsUse[posns]))
        cat("\nVector of Ends\n")
        print(lubridate::as_datetime(endsUse[posns]))
        cat("\n")
    }
    
    # Return the interval data
    intervalData
    
}



createPrecipIntervalPlots <- function(intervalData, 
                                      beginsData, 
                                      titleText, 
                                      yAxisText, 
                                      beginExclude=c(), 
                                      returnPlotsAndData=FALSE
                                      ) {

    # FUNCTION ARGUMENTS:
    # intervalData: the precipitation interval data
    # beginsData: the precipitation begin times
    # titleText: plot title
    # yAxisText: plot y-axis
    # beginExclude: begin times to exclude
    # returnPlotsAndData: whether to return the plot objects
    
    # Exclude any data from begins as needed
    beginsData <- beginsData[!(beginsData %in% lubridate::ymd_hm(beginExclude))]
    
    # Create a plotting data frame
    histFrame <- data.frame(minutesPrecip=as.numeric(intervalData), 
                            month=lubridate::month(beginsData), 
                            rainDate=lubridate::date(beginsData)
                            ) %>%
        mutate(hoursPrecip=minutesPrecip/60)
    
    # Plot the precipitation durations in hours
    p1 <- histFrame %>%
        ggplot(aes(x=hoursPrecip)) +
        geom_histogram() + 
        labs(title=titleText, x=yAxisText, 
             subtitle="Distribution of hours per unique precipitation event"
             )
    print(p1)

    # Plot the precipitation by day in hours
    p2 <- histFrame %>%
        group_by(rainDate) %>%
        summarize(hoursPrecip=sum(hoursPrecip)) %>%
        ggplot(aes(x=hoursPrecip)) +
        geom_histogram() + 
        labs(title=titleText, x=yAxisText, 
             subtitle="Distribution of hours per day of precipitation (on days when 1+ minutes occurred)"
             )
    print(p2)
    
    # Plot the rain totals (in hours) by month
    # Create a data frame of all months and merge in precipitation data as an where available (0 otherwise)
    monthFrame <- histFrame %>%
        group_by(month) %>%
        summarize(minutesPrecip=sum(minutesPrecip), hoursPrecip=sum(hoursPrecip), nPrecip=n()) %>%
        right_join(data.frame(month=1:12, monthName=month.abb[1:12]), by="month") %>%
        tidyr::replace_na(list(minutesPrecip=0, hoursPrecip=0, nPrecip=0))
    # print(monthFrame)
    
    p3 <- monthFrame %>%
        ggplot(aes(x=factor(monthName, levels=month.abb[1:12]), y=hoursPrecip)) +
        geom_col() + 
        labs(title=titleText, y=yAxisText, x="") + 
        geom_text(aes(y=2 + hoursPrecip, label=round(hoursPrecip, 1)))
    print(p3)
    
    if (returnPlotsAndData) {
        list(histFrame=histFrame, monthFrame=monthFrame, p1=p1, p2=p2, p3=p3)
    } else {
        NULL
    }
}



# Precipitation regex to name mapper
precipMapper <- c("(?<!BL)SN"="Snowfall", 
                  "(?<!FZ)RA"="Rainfall", 
                  "(?<!VC)TS"="Thunderstorm"
                  )



# Combining all the functions in one place
runFullPrecipExtraction <- function(df, 
                                    pType, 
                                    titleText=NULL, 
                                    yAxisText=NULL,
                                    endExclude=c(), 
                                    beginExclude=c(), 
                                    endAdd=c(),
                                    beginAdd=c(),
                                    maxProb=1000, 
                                    sState=FALSE, 
                                    makePlots=TRUE, 
                                    returnPlotsAndData=FALSE, 
                                    axisMapper=precipMapper,
                                    titleMapper=cityNameMapper, 
                                    nColMax=3, 
                                    frameName=deparse(substitute(df))
                                    ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame or tibble
    # ptype: the regex for the precipitation type of interest
    # titleText: title text for the associated plot
    # yAxisText: y-axis text for the associated plot
    # endExclude: vector of end times to be excluded, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # beginExclude: vector of begin times to be excluded, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # endAdd: vector of end times to be added, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # beginAdd: vector of begin times to be added, formatted as yyyy-mm-dd hhmm (all zero-padded)
    # maxProb: length of precipitation (in minutes) beyond which a note is thrown
    # sState: whether the starting state is with precipitation
    # makePlots: whether to create plots
    # returnPlotsAndData: whether to return the plot object with the other data
    # axisMapper: mapping file to create y-axis names from pType
    # titleMapper: mapping file to create the title names from df
    # nColMax: the maximum allowed number of columns for each of precipitation begin and end times
    
    # Create the yAxisText if it has not been provided
    if (is.null(yAxisText)) {
        yAxisText <- paste0("Hours of ", axisMapper[pType])
    }
    
    # Create the title if it has not been provided
    if (is.null(titleText)) {
        titleText <- paste0(titleMapper[frameName], " - Hours of ", axisMapper[pType])
    }
    
    # Extract the precipitation data from a specified processed METAR file
    testFileProc <- extractPrecipData(df, pType=pType)
    
    # Confirm that the two-column specification is met (should relax hard-coding on this)
    testBegin <- getBeginEndTimeMatrix(testFileProc, pState="B")
    testEnd <- getBeginEndTimeMatrix(testFileProc, pState="E")

    if ((ncol(testBegin) > nColMax) | ((ncol(testEnd) > nColMax))) { 
        cat("\nBegin columns:", ncol(testBegin), "\t\tEndcolumns:", ncol(testEnd))
        stop("Hard-coded for at most", nColMax, "columns each of begin/end- Fix")
    }

    # Initialize testAllBegins and testAllEnds
    testAllBegins <- beginAdd
    testAllEnds <- endAdd
    
    # For each iteration, append to testAllBegins and testAllEnds as appropriate
    for (ctr in 1:nColMax) {
        
        # Append begin if exists
        if (ncol(testBegin) >= ctr) {
            bt <- getBeginEndTimeVector(testBegin, testFileProc, extractVar=paste0("V", ctr), extractSym="B")
            testAllBegins <- c(testAllBegins, bt[!is.na(bt)])
        }
        
        # Assign end if exists
        if (ncol(testEnd) >= ctr) {
            et <- getBeginEndTimeVector(testEnd, testFileProc, extractVar=paste0("V", ctr), extractSym="E")
            testAllEnds <- c(testAllEnds, et[!is.na(et)])
        }
        
    }
    
    # Convert to date-time and sort
    testAllBegins <- testAllBegins %>%
        lubridate::ymd_hm() %>%
        sort()

    testAllEnds <- testAllEnds %>%
        lubridate::ymd_hm() %>%
        sort()
    
    # Create the intervals
    testIntervals <- createPrecipInterval(testAllEnds, 
                                          testAllBegins, 
                                          endExclude=endExclude, 
                                          beginExclude=beginExclude, 
                                          maxProb=maxProb, 
                                          sState=sState
                                          )
    
    # Create the precipitation plots
    plotOut <- NULL
    if (makePlots) {
        plotOut <- createPrecipIntervalPlots(testIntervals, 
                                             testAllBegins, 
                                             titleText=titleText, 
                                             yAxisText=yAxisText, 
                                             beginExclude=beginExclude, 
                                             returnPlotsAndData=returnPlotsAndData
                                             )
    }
    
    if (!returnPlotsAndData) { plotOut <- NULL }
    
    # Return all of the key files, along with the parameters used
    keyParams <- list(fileName=frameName, 
                      pType=pType, 
                      endExclude=endExclude, 
                      beginExclude=beginExclude, 
                      endAdd=endAdd, 
                      beginAdd=beginAdd,
                      maxProb=maxProb, 
                      sState=sState
                      )
    list(keyParams=keyParams, 
         testFileProc=testFileProc, 
         testAllBegins=testAllBegins[!(testAllBegins %in% lubridate::ymd_hm(beginExclude))], 
         testAllEnds=testAllEnds[!(testAllEnds %in% lubridate::ymd_hm(endExclude))], 
         testIntervals=testIntervals,
         plotOut=plotOut
         )
}

The functions are then run, using the Chicago 2016 data:

# Run for Chicago 2016, extracting rain that is not freezing rain
kord2016ExceptRA <- suggestBeginEndTimes(kord_2016, regMatch="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
kord2016ExceptRA
## # A tibble: 3 x 22
##   dtime               origMETAR curPrecip lagPrecip precipData chgPrecip
##   <dttm>              <chr>     <lgl>     <lgl>     <chr>      <lgl>    
## 1 2016-03-24 21:51:00 KORD 242~ TRUE      FALSE     <NA>       FALSE    
## 2 2016-07-13 18:51:00 KORD 131~ FALSE     TRUE      <NA>       FALSE    
## 3 2016-08-29 19:51:00 KORD 291~ TRUE      FALSE     <NA>       FALSE    
## # ... with 16 more variables: dateUTC <date>, hourUTC <int>, begins <dbl>,
## #   ends <dbl>, b1 <chr>, e1 <chr>, b2 <chr>, e2 <chr>, b3 <lgl>, e3 <lgl>,
## #   etob <lgl>, btoe <lgl>, needBegin <lgl>, needEnd <lgl>, overBegin <lgl>,
## #   overEnd <lgl>
# Run extraction for Chicago 2016
# Attempt to use on the kord2016 rain data
kordRain2016Test <- runFullPrecipExtraction(kord_2016, 
                                            pType="(?<!FZ)RA", 
                                            titleText="Chicago, IL Rainfall (hours) in 2016", 
                                            yAxisText="Hours of Rain", 
                                            endExclude=c(),
                                            beginExclude=c(),
                                            endAdd=c("2016-07-13 1851"), 
                                            beginAdd=c("2016-03-24 2151", "2016-08-29 1951"),
                                            maxProb=1440, 
                                            sState=FALSE, 
                                            makePlots=TRUE
                                            )
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A function can then be written to test the intervals for consistency:

intervalConsistency <- function(lst, 
                                pType, 
                                maxPrint=NULL
                                ){

    # FUNCTION ARGUMENTS
    # lst: list
    # pType: precipitation type
    # maxPrint: maximum number of mismatches to print (default is to print all)
    
    require(lubridate)
    
    # Extract the beginning and interval times
    begins <- lst[["testAllBegins"]]
    ends <- lst[["testAllEnds"]]
    durs <- lst[["testIntervals"]]

    # Create intervals from the raw list file
    precipInts <- lubridate::interval(begins, begins + durs - 1)

    
    # Extract the METAR and date-time information
    metar <- lst[["testFileProc"]][["origMETAR"]]
    dtime <- lst[["testFileProc"]][["dtime"]]
    
    # Take each METAR observation and check two factors
    # Is the precipitation type recorded in that METAR?
    # Does that METAR fall in any of the intervals?
    precipMETAR <- grepl(paste0("\\d{6}Z.*", pType, ".*RMK"), metar, perl=TRUE)
    intMETAR <- sapply(dtime, FUN=function(x) {x %within% precipInts %>% any()})

    # Check for the consistency of the observations and print the mismatches
    print(table(precipMETAR, intMETAR))

    mism <- which(precipMETAR != intMETAR)
    if (length(mism) == 0) {
        cat("\nFull matches between METAR observations and intervals\n")
    } else {
        if (is.null(maxPrint)) { maxPrint <- length(mism) }
        for (x in mism[1:min(maxPrint, length(mism))]) {
            cat("\nMismatch at time", strftime(dtime[x], format="%Y-%m-%d %H:%M", tz="UTC"), "UTC\n")
            print(metar[max(1, x-2):min(length(metar), x+2)])
        }
    }
    
    list(precipInts=precipInts, mismatches=mism, mismatchTimes=dtime[mism])
}

And, the function can then be run for the Chicago 2016 data:

tmp <- intervalConsistency(kordRain2016Test, pType="(?<!FZ)RA")
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals

So, the process works to extract precipitation data, and for the case of Chicago 2016, creates alignment using only the addition of 1 end time and 2 begin times.

An additional helper function is written to take the exceptions from suggestBeginEndTimes() and to create the vectors for endExclude(), beginExclude(), endAdd(), and beginAdd() automatically:

# Helper function to create times in the format preferred by the other functions
timeFormatter <- function(timeVec) {
    
    if (length(timeVec) == 0) {
        return(character(0))
    }
    
    paste0(as.Date(timeVec), 
           " ", 
           str_pad(lubridate::hour(timeVec), pad="0", side="left", width=2),
           str_pad(lubridate::minute(timeVec), pad="0", side="left", width=2)
           )
}

convertExceptionsToTimes <- function(excFile, maxCol=3) {
    
    # Basic beginAdd (just add the METAR time)
    beginAdd <- excFile %>%
        filter(needBegin) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endAdd (just add the METAR time)
    endAdd <- excFile %>%
        filter(needEnd) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endExclude (just delete the earliest end time - maybe add a guard check later)
    endExclude <- excFile %>%
        filter(overEnd) %>%
        select_at(vars(all_of(paste0("e", 1:maxCol)))) %>%
        pull(1)

    # Basic beginExclude (just delete the earliest begin time - maybe add a guard check later)
    beginExclude <- excFile %>%
        filter(overBegin) %>%
        select_at(vars(all_of(paste0("b", 1:maxCol)))) %>%
        pull(1)
    
    list(endAdd=endAdd, beginAdd=beginAdd, endExclude=endExclude, beginExclude=beginExclude)
    
}

And the process can be attempted on the 2016 Chicago snow data:

# Run for Chicago 2016, extracting snow that is not blowing snow
kord2016ExceptSN <- suggestBeginEndTimes(kord_2016, regMatch="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
kord2016ExceptSNList <- convertExceptionsToTimes(kord2016ExceptSN)

# Attempt to use on the kord2016 snow data
kordSnow2016Test <- runFullPrecipExtraction(kord_2016, 
                                            pType="(?<!BL)SN", 
                                            titleText="Chicago, IL Snowfall (hours) in 2016", 
                                            yAxisText="Hours of Snow", 
                                            endExclude=kord2016ExceptSNList$endExclude,
                                            beginExclude=kord2016ExceptSNList$beginExclude,
                                            endAdd=kord2016ExceptSNList$endAdd, 
                                            beginAdd=kord2016ExceptSNList$beginAdd,
                                            maxProb=1440, 
                                            sState=FALSE, 
                                            makePlots=TRUE
                                            )
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   20.25   47.50  134.12  140.75 1016.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

tmp <- intervalConsistency(kordSnow2016Test, pType="(?<!BL)SN")
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals

While there is still some work to be done (the current function makes assumptions about the issues with the precipitation data and associated use of e1/b1, as well as not handling cases where there are, for example, 2 begins with no ends), the function shows promise in automating the process of handling precipitation data.

A few edits have been made to the underlying functions:

  • The search string is restricted to precipitation that follows ‘Z’ and preceds ‘RMK’ to avoid the issue of KMSN being trereated as snow
  • suggestBeginEndTimes is updated to allow for a user-defined maximum number of columns for precipitation begin and end times per hour (previously hard-coded to 3)
  • runFullPrecipExtraction is updated to allow for a user-defined maximum number of columns for precipitation begin and end times per hour (previously hard-coded to 3) and to create axis label and title if not passed

The functions can then be combined, allowing for everything to be run at once:

combinedPrecipitation <- function(df, 
                                  pType, 
                                  titleText=NULL,
                                  yAxisText=NULL,
                                  maxProb=1440,
                                  sState=FALSE,
                                  makePlots=TRUE, 
                                  axisMapper=precipMapper,
                                  titleMapper=cityNameMapper, 
                                  frameName=NULL, 
                                  maxConsistencyPrint=NULL
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the METAR and dtime
    # pType: the regex code for extracting the relevant precipitation type
    # titleText: Desired plot title (will be created from name of df if NULL)
    # yAxisText: desired y-axis labels (will be created from regex if NULL)
    # maxProb: flag any precipitation events longer than this number of minutes
    # sState: boolean, is there precipitation in the first record?
    # makePlots: boolean, should plots be created?
    # axisMapper: mapping file for creating plot axis name from precipitation type
    # titleMapper: mapping file for creating plot title from data frame name
    # frameName: name of the data frame (kluge due to deparse(substitute(df)) returning "df" if this function is called by another function or in a for loop
    # maxConsistencyPrint: how many mismatches should be printed (default, NULL, prints all)
    
    # Create titleText and yAxisText if they are not already created
    if (is.null(yAxisText)) {
        yAxisText <- paste0("Hours of ", axisMapper[pType])
    }

    # Create the frame name if it has nott been passed
    if (is.null(frameName)) {
        frameName <- deparse(substitute(df))
    }
    
    # Create the title if it has not been provided
    if (is.null(titleText)) {
        titleText <- paste0(titleMapper[frameName], " - Hours of ", axisMapper[pType])
    }    
    
    # Get the suggested edits to the begin and end times, then convert to a usable list
    excSuggest <- suggestBeginEndTimes(df, regMatch=pType)    
    excList <- convertExceptionsToTimes(excSuggest)
    
    # Run the full precipitation extraction
    precipData <- runFullPrecipExtraction(df, 
                                          pType=pType, 
                                          titleText=titleText,
                                          yAxisText=yAxisText,
                                          endExclude=excList$endExclude,
                                          beginExclude=excList$beginExclude,
                                          endAdd=excList$endAdd, 
                                          beginAdd=excList$beginAdd,
                                          maxProb=maxProb, 
                                          sState=sState, 
                                          makePlots=makePlots,
                                          frameName=frameName
                                          )

    # Check for consistency    
    precipIntervals <- intervalConsistency(precipData, pType=pType, maxPrint=maxConsistencyPrint)

    # For the time being, return the most important data (the precipData list)
    precipData
    
}

The function can then be tested on a few files:

# Run for Chicago 2016 for snow, rain, and thunderstorm
kord2016SN <- combinedPrecipitation(kord_2016, pType="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   20.25   47.50  134.12  140.75 1016.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8532    0
##       TRUE      0  283
## 
## Full matches between METAR observations and intervals
kord2016RA <- combinedPrecipitation(kord_2016, pType="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
kord2016TS <- combinedPrecipitation(kord_2016, pType="(?<!VC)TS")
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-10-12 17:51:00 KORD 121751Z 23008G19KT 3SM R10L/5000VP6000FT -TSRA BR FE~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 08:51:00 2016-08-28 0825 <NA>  NA   
## 2 2016-09-21 17:51:00 2016-09-21 1744 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 3 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 09:51:00 2016-08-28 0910 <NA>  NA   
## 2 2016-09-21 18:51:00 2016-09-21 1759 <NA>  NA   
## 3 2016-10-06 21:51:00 2016-10-06 2131 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.00   23.50   43.50   62.66   93.25  266.00 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
##  [1] "2016-09-28 23:19:00 UTC" "2016-10-01 16:02:00 UTC"
##  [3] "2016-10-06 10:29:00 UTC" "2016-10-06 23:56:00 UTC"
##  [5] "2016-10-12 12:16:00 UTC" "2016-10-12 17:36:00 UTC"
##  [7] "2016-10-12 17:51:00 UTC" "2016-10-16 09:57:00 UTC"
##  [9] "2016-10-26 14:48:00 UTC" "2016-10-30 07:56:00 UTC"
## [11] "2016-11-02 21:06:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-29 00:12:00 UTC" "2016-10-01 16:21:00 UTC"
##  [3] "2016-10-06 11:04:00 UTC" "2016-10-07 00:25:00 UTC"
##  [5] "2016-10-12 12:41:00 UTC" "2016-10-12 17:33:00 UTC"
##  [7] "2016-10-12 18:05:00 UTC" "2016-10-16 11:20:00 UTC"
##  [9] "2016-10-26 15:28:00 UTC" "2016-10-30 09:37:00 UTC"
## [11] "2016-11-02 21:21:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8749    0
##       TRUE      0   66
## 
## Full matches between METAR observations and intervals

The thunderstorm data contains an interesting consistency error:

kord_2016 %>% 
    filter(monthint==10, day==12, lubridate::hour(dtime) %in% c(14, 15, 16, 17, 18, 19)) %>%
    pull(origMETAR)
## [1] "KORD 121451Z 21013G21KT 10SM FEW050 SCT110 BKN140 OVC250 18/14 A2993 RMK AO2 SLP133 CB DSNT SW TCU DSNT S 60000 T01830144 56005"                                                                                     
## [2] "KORD 121551Z 23011G19KT 10SM FEW021 SCT050CB SCT120 BKN160 BKN250 20/15 A2992 RMK AO2 LTG DSNT SW SLP130 CB S MOV NE CB DSNT SW T02000150"                                                                           
## [3] "KORD 121651Z 24012G18KT 8SM SCT015 OVC020 18/16 A2995 RMK AO2 LTG DSNT SE RAB1554E33 SLP139 P0001 T01780156"                                                                                                         
## [4] "KORD 121751Z 23008G19KT 3SM R10L/5000VP6000FT -TSRA BR FEW014 BKN060CB OVC095 17/16 A2993 RMK AO2 LTG DSNT NE AND SE AND SW RAB1652 TSE33B36 SLP133 OCNL LTGCG E TS E MOV NE P0006 60007 T01720161 10206 20161 50003"
## [5] "KORD 121851Z 23008KT 6SM -RA BR BKN012 BKN095 OVC120 18/16 A2990 RMK AO2 TSE05 SLP123 P0002 T01780161"                                                                                                               
## [6] "KORD 121951Z 26013KT 5SM -RA BR BKN015 BKN019 OVC095 18/16 A2990 RMK AO2 RAE1857B20 SLP124 P0000 T01780156"
  • The 1651Z METAR has no thunderstorm information
  • The 1751Z METAR has thunderstorm information including -TSRA and TSE33B36
  • The 1851Z METAR has no active thunderstorm and TSE05

The algorithm as currently set up is creating a -3 minute interval (from 1736 to 1733) and then also creating a start time of 1751 to match the active -TSRA and following TSE05.

In practice, it is more likely that the TSE33 should be deleted, leaving a short interval of TSB36 (1736) through TSE05 (1805) that overlaps the -TSRA observation at 1751.

Next steps are to work through this further consistency check and test the functions on other locales.

An updated function suggestBeginEndTimes() is created to better manage the situation above. Broadly speaking, there are three types of issues with precipitation data (assuming focus on a single precipitation type, such as rain):

  • Current METAR precipitation state (yes/no) is the same as previous state, but there is a non-zero net of begin and end in the remarks
  • Current METAR has precipitation, previous METAR does not, and there is not a net +1 of begin/end
  • Current METAR has no precipitation, previous METAR does, and there is not a net -1 of begin/end

There are two ancillary precipitation issues that can be observed:

  • Net changes can be OK, but in a manner that implies negative or overlapping intervals - for example, -RA at 1651 and -RA at 1751 with remarks that show RAB18E21 (net change is zero which is OK, but rain already happening cannot begin and if the rain had actually ended at 21 then it would not be occurring at 51)
  • Net changes can be of magnitude greater than one - for example, no rain at 1151 and rain at 1251 and remarks that show RAB11E26B45B50 (net change in rain of +2 where state change makes clear it should be +1)

The primary issues will be addressed first in an updated function:

# Helper function to print out the potential issue
printBeginEndIssues <- function(df, nColMax=3) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the issues
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # Variable names in the file are assumed to have a specific format as hard-coded in the function
    
    cat("\nNeed Begin time\n")
    df %>%
        filter(needBegin) %>%
        select(dtime, origMETAR) %>%
        print()
    
    cat("\nNeed End time\n")
    df %>%
        filter(needEnd) %>%
        select(dtime, origMETAR) %>%
        print()
    
    cat("\nExtraneous Begin time\n")
    df %>%
        filter(overBegin) %>%
        select(dtime, paste0("b", 1:nColMax)) %>%
        print()
    
    cat("\nExtraneous End time\n")
    df %>%
        filter(overEnd) %>%
        select(dtime, paste0("e", 1:nColMax)) %>%
        print()
    
    cat("\nWrong amount of begins or ends\n")
    df %>%
        filter(tooManyBE) %>%
        select(dtime, paste0("e", 1:nColMax), paste0("b", 1:nColMax)) %>%
        print()
    
}


# Update suggestBeginEndTimes
suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE, 
                                 printIssues=TRUE
                                 ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    # printIssues: boolean, whether to print out all of the issues
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
        )
    
    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)
    
    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")
    
    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E")
    
    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
            )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    # Potential issues include:
    # Current and lagged state are the same when begins != ends
    # Current is no, lagged is yes, ends <= begins
    # Current is yes, lagged is no, begins <= ends
    sugExceptions <- sugExceptions %>%
        mutate(issueA=(lagPrecip==curPrecip & begins!=ends), # no state change should have begins==ends
               issueB=(lagPrecip & !curPrecip & ends <= begins),  # change to 'no' should have ends>bsgins
               issueC=(curPrecip & !lagPrecip & begins <= ends),  # change to 'yse' should have begins>ends
               issueD=abs(begins-ends) > 1,  # state is a 1/0 so net change should be -1, 0, or 1
               needBegin=issueC & begins==0,
               needEnd=issueB & ends==0,
               overBegin=(issueA & (begins > ends)) | (issueB & (ends > 0)),
               overEnd=(issueA & (ends > begins)) | (issueC & (begins > 0)),
               tooManyBE=issueD
        )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }

    # Print out the issue if requested
    if (printIssues) {
        printBeginEndIssues(sugExceptions, nColMax=nColMax)
    }
    
    sugExceptions %>%
        filter(issueA | issueB | issueC | issueD)
    
}

The function can then be run on the Chicago 2016 snow data and compared to previous:

# Run for the Chicago snow data
tmp <- suggestBeginEndTimes(kord_2016, regMatch="(?<!BL)SN")
## 
## Regex search code is: ((?<!BL)SN[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-10 00:51:00 KORD 100051Z 31012KT 4SM -SN FEW024 OVC040 M09/M13 A2993 ~
## 2 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 3 2016-04-08 22:51:00 KORD 082251Z 31014G31KT 9SM -SN SCT045 BKN050 OVC080 02/M~
## 4 2016-12-24 01:51:00 KORD 240151Z 17011KT 4SM -SN BR SCT015 OVC039 01/M01 A299~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-09 04:51:00 2016-04-09 0446 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
# Compare to old Chicago snow data
namesAll <- intersect(names(kord2016ExceptSN), names(tmp))
cat("\nVariables deleted:", setdiff(names(kord2016ExceptSN), namesAll), "\n")
## 
## Variables deleted: etob btoe
cat("\nVariables created:", setdiff(names(tmp), namesAll), "\n")
## 
## Variables created: issueA issueB issueC issueD tooManyBE
identical(kord2016ExceptSN[, namesAll], tmp[, namesAll])
## [1] TRUE

So, for the Chicago 2016 snow data, the new function is pulling the same information, just with somewhat different variables (keeping an issue log as variables may be helpful later).

The data are also re-run on Chicago thunderstorms to see if the negative interval is correctly pulled out:

# Run for the Chicago thunderstorm data
tmp <- suggestBeginEndTimes(kord_2016, regMatch="(?<!VC)TS") 
## 
## Regex search code is: ((?<!VC)TS[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 08:51:00 2016-08-28 0825 <NA>  NA   
## 2 2016-09-21 17:51:00 2016-09-21 1744 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 4 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-08-28 09:51:00 2016-08-28 0910 <NA>  NA   
## 2 2016-09-21 18:51:00 2016-09-21 1759 <NA>  NA   
## 3 2016-10-06 21:51:00 2016-10-06 2131 <NA>  NA   
## 4 2016-10-12 17:51:00 2016-10-12 1733 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
# Show the data
tmp %>%
    select(dtime, curPrecip, lagPrecip, precipData, needBegin, needEnd, overBegin, overEnd, tooManyBE)
## # A tibble: 6 x 9
##   dtime               curPrecip lagPrecip precipData needBegin needEnd overBegin
##   <dttm>              <lgl>     <lgl>     <chr>      <lgl>     <lgl>   <lgl>    
## 1 2016-08-28 08:51:00 FALSE     FALSE     TSB25      FALSE     FALSE   TRUE     
## 2 2016-08-28 09:51:00 FALSE     FALSE     TSE10      FALSE     FALSE   FALSE    
## 3 2016-09-21 17:51:00 FALSE     FALSE     TSB44      FALSE     FALSE   TRUE     
## 4 2016-09-21 18:51:00 FALSE     FALSE     TSE1759    FALSE     FALSE   FALSE    
## 5 2016-10-06 21:51:00 FALSE     FALSE     TSE31      FALSE     FALSE   FALSE    
## 6 2016-10-12 17:51:00 TRUE      FALSE     TSE33B36   FALSE     FALSE   FALSE    
## # ... with 2 more variables: overEnd <lgl>, tooManyBE <lgl>
# Run through the conversion to begin and end times
tmp %>%
    convertExceptionsToTimes()
## $endAdd
## character(0)
## 
## $beginAdd
## character(0)
## 
## $endExclude
## [1] "2016-08-28 0910" "2016-09-21 1759" "2016-10-06 2131" "2016-10-12 1733"
## 
## $beginExclude
## [1] "2016-08-28 0825" "2016-09-21 1744"

It appears that this change will correct for the issue on 2016-10-12 with TSE33B36.

The process can then be run for the rain data in all of the files, with outputs saved to a single master list (initial exploration showed some issues with Minneapolis and New Orleans, so these are excluded to start):

# Create the list of files to run on - everything in 2016
fileNames <- grep(pattern="2016", names(cityNameMapper), value=TRUE)
fileNames <- fileNames[!(fileNames %in% c("kmsp_2016", "kmsy_2016"))]
print(fileNames)
##  [1] "kdtw_2016" "kewr_2016" "kgrb_2016" "kgrr_2016" "kiah_2016" "kind_2016"
##  [7] "klas_2016" "klnk_2016" "kmke_2016" "kmsn_2016" "kord_2016" "ksan_2016"
## [13] "ktvc_2016"
# Create a main storage list
listRA <- vector("list", length(fileNames))
names(listRA) <- fileNames

for (fName in fileNames) {

    listRA[[fName]] <- combinedPrecipitation(get(fName), pType="(?<!FZ)RA", frameName=fName)

}
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-02-24 11:53:00 KDTW 241153Z 04014G19KT 5SM -RAPL SCT008 OVC012 01/M02 A2~
## 2 2016-04-07 23:53:00 KDTW 072353Z 30007KT 7SM -SNRA BKN022 OVC041 01/M01 A2962~
## 
## Need End time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-07 22:53:00 KDTW 072253Z 30013G19KT 1 1/2SM R03R/6000VP6000FT -SN BKN~
## 2 2016-08-16 14:53:00 KDTW 161453Z 22019G22 10SM BKN014 OVC030 24/22 A2993 RMK ~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-01 00:53:00 2016-04-01 0001 <NA>  NA   
## 2 2016-08-15 00:53:00 2016-08-14 2355 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-07-14 00:53:00 2016-07-14 0030 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   35.00   86.58   89.50  932.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8258    0
##       TRUE      0  560
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-02-24 19:51:00 2016-02-24 1858 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-11-20 08:51:00 2016-11-20 0758 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    17.0    41.0    97.3   104.0   858.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8195    0
##       TRUE      0  626
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-05-02 17:53:00 2016-05-02 1739 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   28.00   61.16   63.00  857.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8378    0
##       TRUE      0  425
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-06-26 14:53:00 KGRR 261453Z 25005KT 1 1/2SM RA BR SCT049 BKN090 OVC110 2~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-24 15:53:00 KGRR 241553Z 14004KT 10SM TS FEW048CB BKN080 OVC110 22/20~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-26 13:53:00 2016-06-26 1346 <NA>  NA   
## 2 2016-08-12 09:53:00 2016-08-12 0854 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-07 12:53:00 2016-04-07 1215 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   40.50   94.02  111.75  901.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8208    0
##       TRUE      0  604
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-08-14 00:53:00 KIAH 140053Z 00000KT 4SM TSRA BR SCT007 SCT020 BKN045CB O~
## 2 2016-08-14 22:53:00 KIAH 142253Z 09008KT 1/2SM R27/1600V2400FT -TSRA FEW006 S~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                          
##   <dttm>              <chr>                                              
## 1 2016-12-04 10:53:00 KIAH 041053Z 32007KT 8SM OVC005 11/10 A2993 RMK AO2
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2              b3   
##   <dttm>              <chr>           <chr>           <lgl>
## 1 2016-08-21 19:53:00 2016-08-21 1857 2016-08-21 1945 NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-06 00:53:00 2016-06-06 0040 <NA>  NA   
## 2 2016-12-03 18:53:00 2016-12-03 1812 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -18.00   13.00   28.00   59.71   65.00  845.00 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
##  [1] "2016-08-19 18:36:00 UTC" "2016-08-20 18:04:00 UTC"
##  [3] "2016-08-20 19:15:00 UTC" "2016-08-21 07:52:00 UTC"
##  [5] "2016-08-21 14:55:00 UTC" "2016-08-21 19:45:00 UTC"
##  [7] "2016-08-22 16:26:00 UTC" "2016-08-23 20:57:00 UTC"
##  [9] "2016-08-24 20:02:00 UTC" "2016-08-25 18:12:00 UTC"
## [11] "2016-08-26 11:41:00 UTC"
## 
## Vector of Ends
##  [1] "2016-08-19 19:24:00 UTC" "2016-08-20 18:49:00 UTC"
##  [3] "2016-08-20 19:49:00 UTC" "2016-08-21 08:10:00 UTC"
##  [5] "2016-08-21 15:21:00 UTC" "2016-08-21 19:27:00 UTC"
##  [7] "2016-08-22 20:26:00 UTC" "2016-08-23 21:09:00 UTC"
##  [9] "2016-08-24 20:41:00 UTC" "2016-08-25 20:23:00 UTC"
## [11] "2016-08-26 12:26:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8367    1
##       TRUE      0  448
## 
## Mismatch at time 2016-03-23 21:53 UTC
## [1] "KIAH 231953Z 17012G18KT 10SM BKN032 OVC050 23/17 A2986 RMK AO2 SLP110 T02280167"                                     
## [2] "KIAH 232053Z 16017G24KT 10SM SCT030 OVC048 22/18 A2984 RMK AO2 RAB02E12 SLP104 SCT V BKN P0000 60000 T02220178 56021"
## [3] "KIAH 232153Z 17011KT 10SM SCT029 OVC050 22/18 A2983 RMK AO2 RAB2058E13 SLP102 P0000 T02170178"                       
## [4] "KIAH 232253Z 16010KT 10SM SCT030 OVC055 21/18 A2982 RMK AO2 SLP098 RAB2145E16 P0000 T02060183"                       
## [5] "KIAH 232353Z 16014G19KT 10SM BKN018 OVC050 20/18 A2982 RMK AO2 SLP096 60000 T02000183 10244 20200 56008"             
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-18 17:54:00 KIND 181754Z 24013KT 10SM -TSRA BKN030CB OVC040 24/21 A30~
## 
## Need End time
## # A tibble: 3 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-07 23:54:00 KIND 072354Z 30011KT 10SM BKN037 BKN060 04/M01 A2980 RMK ~
## 2 2016-05-08 19:54:00 KIND 081954Z 19007KT 10SM FEW010 OVC080 14/09 A2995 RMK A~
## 3 2016-08-07 19:54:00 KIND 071954Z 29007KT 10SM SCT060 SCT250 31/16 A2999 RMK A~
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-09-05 21:54:00 2016-09-05 2101 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   16.00   39.50   85.24   97.75  886.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8144    0
##       TRUE      0  623
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-30 23:56:00 KLAS 302356Z 04026G31KT 6SM TSRA SCT035 BKN080CB OVC100 1~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    16.0    29.0    52.1    57.5   502.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8710    0
##       TRUE      0  108
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-08-29 22:54:00 KLNK 292254Z 35010KT 8SM -TSRA FEW070 BKN100 OVC120 23/20~
## 2 2016-09-16 06:54:00 KLNK 160654Z 36015G25KT 6SM RA BR FEW070 OVC110 16/15 A30~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:54:00 KLNK 131854Z 26005KT 10SM CLR 32/16 A2988 RMK AO2 SLP106 ~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   34.00   67.63   82.00  522.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8500    0
##       TRUE      0  313
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 4 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-04-10 10:52:00 KMKE 101052Z 16017G26KT 8SM -PLRASN FEW046 BKN065 OVC075 ~
## 2 2016-06-10 03:52:00 KMKE 100352Z 20005KT 10SM -TSRA SCT070CB BKN090 18/16 A29~
## 3 2016-08-20 13:52:00 KMKE 201352Z 17010KT 10SM -RA FEW016 FEW035 BKN090 BKN150~
## 4 2016-10-07 02:52:00 KMKE 070252Z 18007KT 3SM VCTS -RA BR FEW030 BKN070 BKN095~
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    14.0    32.0    71.8    73.0  1335.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8346    0
##       TRUE      0  462
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-16 11:53:00 KMSN 161153Z 16011KT 10SM FEW090 SCT110 OVC150 18/16 A299~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 1 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-09-07 12:53:00 2016-09-07 1252 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   29.00   70.11   74.75  990.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8347    0
##       TRUE      0  451
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 2 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-24 21:51:00 KORD 242151Z 34014KT 1 3/4SM -RASN BR SCT010 OVC016 01/00~
## 2 2016-08-29 19:51:00 KORD 291951Z 25009KT 1SM R10L/1200VP6000FT +TSRA BKN041CB~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-07-13 18:51:00 KORD 131851Z 22009G18KT 10SM FEW027 SCT047 BKN250 28/22 A~
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.00   33.50   70.51   83.50 1243.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8349    0
##       TRUE      0  466
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Extraneous Begin time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, b1 <chr>, b2 <chr>, b3 <lgl>
## 
## Extraneous End time
## # A tibble: 0 x 4
## # ... with 4 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   12.00   21.00   55.53   51.75  638.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8612    0
##       TRUE      0  198
## 
## Full matches between METAR observations and intervals
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-28 19:53:00 KTVC 281953Z 36010KT 10SM FEW027 FEW100 SCT250 19/11 A300~
## 
## Extraneous Begin time
## # A tibble: 1 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-06-06 21:53:00 2016-06-06 2117 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 3 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-03-28 10:53:00 2016-03-28 0957 <NA>  NA   
## 2 2016-06-06 20:53:00 2016-06-06 2049 <NA>  NA   
## 3 2016-12-02 16:53:00 2016-12-02 1629 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.50   30.00   66.53   68.00  642.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8345    0
##       TRUE      0  469
## 
## Full matches between METAR observations and intervals

Initial outputs are encouraging:

  • In all locales, the rain intervals created perfectly align with the hourly METAR rain data (with the exception of Houston, TX on 2016-03-23 21:53)
  • In all locales, the minimum rain interval length is positive (with the exception of Houston, TX 2016-08-21 1945/1927)
  • In all locales, the maximum rain interval length is under 24 hours

Further diagnostics of the Houston situation include:

# August 21, 2016 issue
kiah_2016 %>% 
    filter(lubridate::date(dtime)=="2016-08-21", lubridate::hour(dtime) %in% c(18, 19, 20, 21)) %>%
    pull(origMETAR)
## [1] "KIAH 211853Z 08005KT 10SM FEW018 FEW033 SCT090 BKN150 BKN180 BKN250 29/22 A3001 RMK AO2 SLP160 CB DSNT SW T02890222"         
## [2] "KIAH 211953Z 13009KT 10SM SCT029 BKN049 BKN100 BKN150 BKN250 27/24 A2999 RMK AO2 RAB1857E27B45RAEMM SLP152 P0000 T02720244 $"
## [3] "KIAH 212053Z VRB05KT 10SM FEW017 SCT029 SCT050 BKN160 BKN250 30/24 A2998 RMK AO2 SLP150 60000 T03000244 58012"               
## [4] "KIAH 212153Z 12009KT 10SM BKN023 BKN160 BKN250 29/24 A2997 RMK AO2 SLP148 BKN023 V SCT T02940239"
# March 23, 2016 issue
kiah_2016 %>% 
    filter(lubridate::date(dtime)=="2016-03-23", lubridate::hour(dtime) %in% c(20, 21, 22, 23)) %>%
    pull(origMETAR)
## [1] "KIAH 232053Z 16017G24KT 10SM SCT030 OVC048 22/18 A2984 RMK AO2 RAB02E12 SLP104 SCT V BKN P0000 60000 T02220178 56021"
## [2] "KIAH 232153Z 17011KT 10SM SCT029 OVC050 22/18 A2983 RMK AO2 RAB2058E13 SLP102 P0000 T02170178"                       
## [3] "KIAH 232253Z 16010KT 10SM SCT030 OVC055 21/18 A2982 RMK AO2 SLP098 RAB2145E16 P0000 T02060183"                       
## [4] "KIAH 232353Z 16014G19KT 10SM BKN018 OVC050 20/18 A2982 RMK AO2 SLP096 60000 T02000183 10244 20200 56008"

The August 21 situation highlights an area where the guard logic needs improvement:

  • 1953Z … RAB1857E27B45EMM requires either 1) replace RAEMM with RAE53 (treat as needEnd), or 2) remove the B45 that is paired to the EMM (treat as overBegin)
  • The current guard logic will see this as 2 begins and 1 end and treat as an overBegin (OK so far); since there is an end time, the guard logic just delete the first begin time to make them equal
  • This is a guard logic error of potentially very minor consequence in that it produces under 60 minutes of problem across 15 cities over the course of a year

The March 23 situation highlights an error in the METAR:

  • 2253Z … RAB2145E16 creates rain at a time period that belongs to the 2153Z METAR (likely a transpose typo since RAB2154E16 would be appropriate in the 2253Z METAR)
  • The guard logic should be updated to only allow four-digit times that ‘belong’ to the current METAR, flagging a warning and converting them to be in range automatically

The New Orleans data are run stand-alone since there were a few more issues:

tmpMSY <- combinedPrecipitation(kmsy_2016, pType="(?<!FZ)RA", frameName="kmsy_2016")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 0 x 2
## # ... with 2 variables: dtime <dttm>, origMETAR <chr>
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-03-27 16:53:00 KMSY 271653Z 12010KT 10SM BKN008 BKN030CB BKN060 OVC130 2~
## 
## Extraneous Begin time
## # A tibble: 2 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-03-11 20:53:00 2016-03-11 2007 <NA>  NA   
## 2 2016-08-14 23:53:00 2016-08-14 2339 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-12-13 23:53:00 2016-12-13 2258 <NA>  NA   
## 2 2017-01-01 09:53:00 2017-01-01 0852 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -528234      15      30   -1142      66     738 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2017-01-01 13:06:00 UTC" "2017-01-01 17:04:00 UTC"
## [3] "2017-01-01 18:41:00 UTC" "2017-01-01 19:51:00 UTC"
## [5] "2017-01-01 20:15:00 UTC"
## 
## Vector of Ends
## [1] "2017-01-01 13:48:00 UTC" "2017-01-01 17:23:00 UTC"
## [3] "2017-01-01 18:54:00 UTC" "2017-01-01 19:55:00 UTC"
## [5] "2017-01-01 20:49:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8375    0
##       TRUE      3  435
## 
## Mismatch at time 2017-01-01 21:53 UTC
## [1] "KMSY 011953Z 20015G21KT 10SM +RA FEW022 SCT046 BKN065 OVC130 24/22 A2991 RMK AO2 LTG DSNT N RAE1854B51 SLP132 P0000 T02440222 $"                       
## [2] "KMSY 012053Z 28007KT 9SM OVC024 19/19 A2993 RMK AO2 WSHFT 1957 LTG DSNT NE AND W RAE1955B15E49 SLP140 CB DSNT N P0007 60013 T01940189 55004 $"         
## [3] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [4] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [5] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"   
## 
## Mismatch at time 2017-01-01 22:53 UTC
## [1] "KMSY 012053Z 28007KT 9SM OVC024 19/19 A2993 RMK AO2 WSHFT 1957 LTG DSNT NE AND W RAE1955B15E49 SLP140 CB DSNT N P0007 60013 T01940189 55004 $"         
## [2] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [3] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [4] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"   
## 
## Mismatch at time 2017-01-01 23:53 UTC
## [1] "KMSY 012153Z 29007KT 5SM -TSRA BR BKN028CB OVC035 19/19 A2995 RMK AO2 LTG DSNT W RAB2055 TSB31 SLP145 OCNL LTGICCG W-N TS W-N MOV NE P0004 T01890189 $"
## [2] "KMSY 012253Z 29007KT 5SM RA BR BKN032CB BKN041 OVC050 18/18 A2995 RMK AO2 LTG DSNT SE-W TSE06 SLP146 CB ALQDS MOV NE P0019 T01830183 $"                
## [3] "KMSY 012353Z 22003KT 3SM RA BR FEW034 SCT041 OVC055 18/18 A2998 RMK AO2 LTG DSNT E SLP155 CB ALQDS MOV NE P0028 60064 T01780178 10244 20178 53016 $"

New Orleans is interesting in that the final hours of the final day of the pulled METAR are all raining. This leads (correctly) to no final RAE in the METAR, and the logic should be updated to place an artificial RAE in the last minute of the file. Without that, there is recycling, and the interval becomes -365 days. There is a similar need to add logic in case the beginning state is RAB.

The Minneapolis data are run stand-alone since there are many issues:

tmpMSP <- combinedPrecipitation(kmsp_2016, pType="(?<!FZ)RA", frameName="kmsp_2016", maxConsistencyPrint = 5)
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-06 07:53:00 KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 ~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-11-22 13:53:00 KMSP 221353Z 13018KT 10SM -PL OVC055 01/M06 A3015 RMK AO2~
## 
## Extraneous Begin time
## # A tibble: 4 x 4
##   dtime               b1              b2              b3   
##   <dttm>              <chr>           <chr>           <lgl>
## 1 2016-05-29 19:53:00 2016-05-29 1931 <NA>            NA   
## 2 2016-06-20 04:53:00 2016-06-20 0411 <NA>            NA   
## 3 2016-09-06 08:53:00 2016-09-06 0744 2016-09-06 0817 NA   
## 4 2016-11-29 07:53:00 2016-11-29 0718 <NA>            NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-05 15:53:00 2016-04-05 1522 <NA>  NA   
## 2 2016-10-18 04:53:00 2016-10-18 0435 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -476532.0      18.0      67.5    -818.0     188.0   28458.0 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2016-12-05 18:06:00 UTC" "2016-12-05 18:50:00 UTC"
## [3] "2016-12-06 00:35:00 UTC" "2016-12-23 17:02:00 UTC"
## [5] "2016-12-25 17:59:00 UTC"
## 
## Vector of Ends
## [1] "2016-12-05 18:59:00 UTC" "2016-12-06 00:50:00 UTC"
## [3] "2016-12-23 17:45:00 UTC" "2016-12-25 22:35:00 UTC"
## [5] "2016-12-26 05:07:00 UTC"
## 
## 
## Potential problem Detected - very long.  Data to help investigate
## 
## Positions with problems are: 260 266 268 269 270 271 278 280 288 289 294 298 303 304 312 314 319 321 327 328 350 353 354
## Vector of Begins
##  [1] "2016-09-07 01:55:00 UTC" "2016-09-07 02:22:00 UTC"
##  [3] "2016-09-07 12:44:00 UTC" "2016-09-07 14:12:00 UTC"
##  [5] "2016-09-07 15:08:00 UTC" "2016-09-07 17:02:00 UTC"
##  [7] "2016-09-09 00:52:00 UTC" "2016-09-09 05:36:00 UTC"
##  [9] "2016-09-09 07:51:00 UTC" "2016-09-09 11:49:00 UTC"
## [11] "2016-09-09 15:33:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-07 02:57:00 UTC" "2016-09-07 13:19:00 UTC"
##  [3] "2016-09-07 14:57:00 UTC" "2016-09-07 15:44:00 UTC"
##  [5] "2016-09-07 17:11:00 UTC" "2016-09-09 01:01:00 UTC"
##  [7] "2016-09-09 05:53:00 UTC" "2016-09-09 09:31:00 UTC"
##  [9] "2016-09-09 12:39:00 UTC" "2016-09-09 15:57:00 UTC"
## [11] "2016-09-10 06:10:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  5875 2515
##       TRUE      0  427
## 
## Mismatch at time 2016-09-06 08:53 UTC
## [1] "KMSP 060653Z 21010G17KT 10SM BKN050 BKN060 OVC090 27/22 A2983 RMK AO2 SLP097 T02720217"                                                                                             
## [2] "KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 26/22 A2982 RMK AO2 PK WND 22027/0742 LTG DSNT NE AND S SLP092 TSB35 OCNL LTGICCG OHD TS OHD MOV NE P0015 T02560222"       
## [3] "KMSP 060853Z 24003KT 8SM TS BKN049CB BKN060 OVC080 25/22 A2981 RMK AO2 LTG DSNT ALQDS SLP088 RAB0744E01B17 TSB0735E12B22 OCNL LTGICCG OHD TS OHD MOV NE P0001 60016 T02500222 58008"
## [4] "KMSP 060953Z 25007KT 4SM -TSRA BR SCT027CB BKN041 OVC080 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAB34 TSE01B13E28B38 SLP089 FRQ LTGICCG OHD TS OHD MOV NE P0020 T02390222"              
## [5] "KMSP 061053Z 18003KT 5SM -TSRA BR FEW029 BKN045CB OVC065 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAE01B14E28B46 TSE02B12E36B41 SLP088 OCNL LTGICCG N-E VCTS N-E MOV NE P0005 T02440222"  
## 
## Mismatch at time 2016-09-06 14:53 UTC
## [1] "KMSP 061253Z 00000KT 9SM -RA FEW007 SCT015 BKN100 OVC150 22/21 A2989 RMK AO2 LTG DSNT SE RAE12B37 SLP116 P0004 T02220206"     
## [2] "KMSP 061353Z 21007KT 8SM -RA FEW010 OVC090 22/21 A2996 RMK AO2 RAE00B21 SLP140 P0007 T02220206"                               
## [3] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"
## [4] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"       
## [5] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                   
## 
## Mismatch at time 2016-09-06 16:53 UTC
## [1] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"       
## [2] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [3] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [4] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [5] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## 
## Mismatch at time 2016-09-06 17:53 UTC
## [1] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [2] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [3] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [4] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [5] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## 
## Mismatch at time 2016-09-06 18:53 UTC
## [1] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [2] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [3] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [4] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## [5] "KMSP 062053Z 19003KT 10SM FEW018 BKN041 BKN130 BKN250 25/19 A2994 RMK AO2 LTG DSNT SE AND S SLP134 60000 T02500189 50002"
testMSP <- tibble::tibble(beginTime=tmpMSP$testAllBegins, 
                          endTime=c(tmpMSP$testAllEnds, NA),  # make the same length as begins
                          ints=tmpMSP$testIntervals
                          )

testMSP %>%
    filter(ints >= 2000 | lubridate::date(beginTime) == "2016-09-06") %>%
    as.data.frame()
##              beginTime             endTime       ints
## 1  2016-09-06 07:53:00 2016-09-06 08:01:00     8 mins
## 2  2016-09-06 08:17:00 2016-09-06 10:01:00   104 mins
## 3  2016-09-06 09:34:00 2016-09-06 10:28:00    54 mins
## 4  2016-09-06 10:14:00 2016-09-06 11:01:00    47 mins
## 5  2016-09-06 10:46:00 2016-09-06 12:12:00    86 mins
## 6  2016-09-06 11:16:00 2016-09-06 13:00:00   104 mins
## 7  2016-09-06 12:37:00 2016-09-06 14:47:00   130 mins
## 8  2016-09-06 13:21:00 2016-09-06 16:40:00   199 mins
## 9  2016-09-06 15:13:00 2016-09-06 19:49:00   276 mins
## 10 2016-09-06 19:20:00 2016-09-07 01:25:00   365 mins
## 11 2016-09-10 05:46:00 2016-09-13 23:51:00  5405 mins
## 12 2016-09-14 01:09:00 2016-09-16 02:19:00  2950 mins
## 13 2016-09-17 01:07:00 2016-09-19 16:05:00  3778 mins
## 14 2016-09-19 14:58:00 2016-09-21 05:04:00  2286 mins
## 15 2016-09-22 11:01:00 2016-09-23 23:46:00  2205 mins
## 16 2016-09-25 22:14:00 2016-09-28 00:06:00  2992 mins
## 17 2016-09-27 21:33:00 2016-10-04 15:07:00  9694 mins
## 18 2016-10-05 01:09:00 2016-10-06 22:52:00  2743 mins
## 19 2016-10-07 06:46:00 2016-10-12 04:37:00  7071 mins
## 20 2016-10-12 10:59:00 2016-10-15 22:57:00  5038 mins
## 21 2016-10-18 05:48:00 2016-10-26 13:17:00 11969 mins
## 22 2016-10-26 16:18:00 2016-10-29 17:11:00  4373 mins
## 23 2016-10-29 22:48:00 2016-11-18 17:06:00 28458 mins
## 24 2016-11-18 18:13:00 2016-11-22 13:53:00  5500 mins
## 25 2016-11-22 22:53:00 2016-11-25 00:21:00  2968 mins
## 26 2016-11-24 22:49:00 2016-11-28 05:16:00  4707 mins
## 27 2016-12-01 07:20:00 2016-12-05 18:28:00  6428 mins
## 28 2016-12-06 00:35:00 2016-12-23 17:45:00 25510 mins
## 29 2016-12-23 17:02:00 2016-12-25 22:35:00  3213 mins

The Minneapolis data appear to first become misaligned during a September 6 rain event.

  • 2016-09-06 0653 no rain
  • 2016-09-06 0753 -TSRA TSB35 (algorithm correctly flags “needs begin” for 0753)
  • 2016-09-06 0853 TS RAB0744E01B17 (algorithm correctly flags as “over-begin” but guard logic needs to be updated to delete both of the beginning times since net-end should be 1 for -TSRA becoming TS)
  • Every following record will be erroneous, including overlapping intervals, since the extra begin record is kept
  • Additionally, the RAB0744 should be in the 0753 observation (same issue as with Houston)

Next steps are to investigate and fix the following:

  • Remove any RAB/RAE observations that do not “belong” to the hour that falls under the current METAR
  • Check whether overBegin or overEnd have 2+ causes, and upgrade the logic to delete all if needed
  • Select the correct RAB to discard when RABnnEnnBnn is observed where there should be no net change (same guard issue for RAE)
  • Upgrade the logic to manage a start state of precipitation (create RAB at the beginning so the interval overlaps) or an end state of precipitation (create an RAE at the end to avoid recycling vectors)

The functions that need updating are suggestBeginEndTimes and convertExceptionsToTimes and getBeginEndTimeMatrix:

# Updated function to account for:
# 1. Remove any RAB/RAE observations that have dddd that 'belongs' to a different hourly METAR
# 2. Check whether overBegin or overEnd have 2+ causes, and delete all that are necessary
# 3. Upgrade logic to manage a starting state of precipitation, or an ending state of precipitation
# 4. Select the correct RAB/RAE to discard in cases like RABnnEnnBnn where net change should be 0


# Helper function to extract the beginning and ending times using str_extract_all
# Updated to check that four-digit times are within the proper hour
getBeginEndTimeMatrix <- function(file, 
                                  pullVar="precipData", 
                                  pState="B", 
                                  timeVar=NULL
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # file: file containing data
    # pullVar: the variable of interest in file
    # pState: the precipitation state of interest - B (begin) or E (end)
    # timeVar: the datetime variable in file which can be used to check consistency (NULL means no checks)
    
    # The basic extraction process
    mtxBET <- file %>%
        pull(pullVar) %>%
        str_extract_all(paste0(pState, "\\d+"), simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE) %>%
        tibble::as_tibble()
    
    # If time variable has been passed, check for consistency of extracted times
    if (!is.null(timeVar)) {
        chk <- mtxBET %>%
            mutate(time=pull(file, timeVar)) %>%
            pivot_longer(-time, names_to="column", values_to="value") %>%
            mutate(convVal=str_replace(value, pattern="B", replacement=""), 
                   keyMin=lubridate::minute(time), 
                   delta=ifelse(str_length(convVal)==4, as.integer(str_sub(convVal, 3, -1))-keyMin, NA)
                   )
        
        issues <- chk %>%
            filter(delta <= 0)
        
        if (nrow(issues) > 0) {
            
            # Flag the issue
            cat("\nMETAR has begin or end time belonging to a different record - will be deleted\n")
            print(issues)
            
            # Delete the issue
            chk <- chk %>%
                filter(is.na(delta) | delta > 0)
            
            # Fix the V1-Vn due to deletions
            chk <- chk %>%
                group_by(time) %>%
                mutate(column=paste0("V", row_number())) %>%
                ungroup()
            
            # Update mtxBET
            mtxBET <- chk %>%
                select(time, column, value) %>%
                pivot_wider(names_from=column, values_from=value) %>%
                select(-time)
        }
    }
    
    # Return the matrix (converted to data frame)
    mtxBET
    
}


# Update to pull any cases of 2+ overBegin or 2+ overEnd
convertExceptionsToTimes <- function(excFile, maxCol=3) {
    
    # Basic beginAdd (just add the METAR time)
    beginAdd <- excFile %>%
        filter(needBegin) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endAdd (just add the METAR time)
    endAdd <- excFile %>%
        filter(needEnd) %>%
        pull(dtime) %>%
        timeFormatter()
    
    # Basic endExclude (just delete the earliest end time - maybe add a guard check later)
    endExclude <- excFile %>%
        filter(overEnd) %>%
        select_at(vars(all_of(c("dtime", "nOverEnd", paste0("e", 1:maxCol))))) %>%
        pivot_longer(-c(dtime, nOverEnd), names_to="column", values_to="time") %>%
        filter(as.integer(str_replace(column, pattern="e", replacement="")) <= nOverEnd) %>%
        pull(time)

    # Basic beginExclude (just delete the earliest begin time - maybe add a guard check later)
    beginExclude <- excFile %>%
        filter(overBegin) %>%
        select_at(vars(all_of(c("dtime", "nOverBegin", paste0("b", 1:maxCol))))) %>%
        pivot_longer(-c(dtime, nOverBegin), names_to="column", values_to="time") %>%
        filter(as.integer(str_replace(column, pattern="b", replacement="")) <= nOverBegin) %>%
        pull(time)
    
    list(endAdd=endAdd, beginAdd=beginAdd, endExclude=endExclude, beginExclude=beginExclude)
    
}



# Updated to pull out the number of overBegin and overEnd
suggestBeginEndTimes <- function(df, 
                                 regMatch, 
                                 nColMax=3, 
                                 printNACols=FALSE, 
                                 printIssues=TRUE
                                 ) {
    
    # FUNCTION ARGUMENTS
    # df: the data frame to be examined
    # regMatch: the relevant regex extraction code
    # nColMax: the maximum number of columns allowed for precipitation begins and precipitation ends
    # printNACols: boolean, whether to print the NA column sums
    # printIssues: boolean, whether to print out all of the issues
    
    # Pull the data and check for the specified precipitation pattern and lags
    # Needs to follow the \\d{6}Z that shows the time and precede the RMK that denotes the remarks
    sugStates <- df %>%
        select(dtime, origMETAR) %>%
        mutate(curPrecip=str_detect(origMETAR, paste0("\\d{6}Z.*", regMatch, ".*RMK")), 
               lagPrecip=lag(curPrecip, 1)
        )
    
    # Use the analysis data to look for begins and ends flagged in the remarks
    sugBE <- extractPrecipData(sugStates, pType=regMatch)
    
    # Inner join the data by dtime
    sugBEJoin <- sugBE %>% 
        select(dtime, precipData, chgPrecip=isPrecip, dateUTC, hourUTC)
    sugStates <- sugStates %>%
        inner_join(sugBEJoin, by="dtime")
    
    # Get the beginning and end times data for the desired precipitation type
    sugBegin <- getBeginEndTimeMatrix(sugStates, pState="B", timeVar="dtime")
    sugEnd <- getBeginEndTimeMatrix(sugStates, pState="E", timeVar="dtime")
    
    # Ensure that user-defined maximum number of columns is not exceeded
    if (ncol(sugBegin) > nColMax | ncol(sugEnd) > nColMax) { 
        stop("Function argument set to allow only for 0 - ", nColMax, "columns, investigate and fix")
    }
    
    # For each iteration, create testBTn and testETn (use NA if no column)
    for (ctr in 1:nColMax) {
        
        # Create NA as the baseline
        assign(paste0("testBT", ctr), NA)
        assign(paste0("testET", ctr), NA)
        
        # Assign begin if exists
        if (ncol(sugBegin) >= ctr) {
            bt <- getBeginEndTimeVector(sugBegin, sugStates, extractVar=paste0("V", ctr), extractSym="B")
            assign(paste0("testBT", ctr), bt)
        }
        
        # Assign end if exists
        if (ncol(sugEnd) >= ctr) {
            et <- getBeginEndTimeVector(sugEnd, sugStates, extractVar=paste0("V", ctr), extractSym="E")
            assign(paste0("testET", ctr), et)
        }
        
    }
    
    # Integrate to a single file, allowing for nMaxCol applications of variables
    # Create file sugExceptions as sugStates and initialize the columns for begins and ends
    sugExceptions <- sugStates %>%
        mutate(begins=0, ends=0)
    
    # Loop through up to nColMax, adding the appropriate bn=testBTn and en=testETn
    for (ctr in 1:nColMax) {
        sugExceptions <- sugExceptions %>%
            mutate(!!paste0("b", ctr) := get(paste0("testBT", ctr)), 
                   !!paste0("e", ctr) := get(paste0("testET", ctr)), 
                   begins=begins + ifelse(is.na(get(paste0("b", ctr))), 0, 1),
                   ends=ends + ifelse(is.na(get(paste0("e", ctr))), 0, 1)
            )
    }
    
    # Create the final version by checking key criteria
    # Add a criteria for b before e if previous; and e before b if not previous?
    # Potential issues include:
    # Current and lagged state are the same when begins != ends
    # Current is no, lagged is yes, ends <= begins
    # Current is yes, lagged is no, begins <= ends
    sugExceptions <- sugExceptions %>%
        mutate(issueA=(lagPrecip==curPrecip & begins!=ends), # no state change should have begins==ends
               issueB=(lagPrecip & !curPrecip & ends <= begins),  # change to 'no' should have ends>bsgins
               issueC=(curPrecip & !lagPrecip & begins <= ends),  # change to 'yse' should have begins>ends
               issueD=abs(begins-ends) > 1,  # state is a 1/0 so net change should be -1, 0, or 1
               netBeginEndShouldBe=(curPrecip - lagPrecip),
               netBeginEndIs=(begins - ends),
               needBegin=issueC & begins==0,
               needEnd=issueB & ends==0,
               overBegin=(issueA & (begins > ends)) | (issueB & (ends > 0)),
               overEnd=(issueA & (ends > begins)) | (issueC & (begins > 0)),
               nOverBegin=pmax(0, netBeginEndIs-netBeginEndShouldBe-needEnd),
               nOverEnd=pmax(0, netBeginEndShouldBe-netBeginEndIs-needBegin),
               tooManyBE=issueD
        )
    
    # Report on the NA status by column
    if (printNACols) {
        colSums(is.na(sugExceptions)) %>% 
            print()
    }
    
    # Print out the issue if requested
    if (printIssues) {
        printBeginEndIssues(sugExceptions, nColMax=nColMax)
    }
    
    sugExceptions %>%
        filter(issueA | issueB | issueC | issueD)
    
}

With the logic updated to correct for times in the wrong METAR as well as 2+ occurrences of overBegin or overEnd, the process is re-run for Minneapolis:

tmpMSP2 <- combinedPrecipitation(kmsp_2016, 
                                 pType="(?<!FZ)RA",
                                 frameName="kmsp_2016", 
                                 maxConsistencyPrint = 5
                                 )
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*) 
## 
## 
## METAR has begin or end time belonging to a different record - will be deleted
## # A tibble: 1 x 6
##   time                column value convVal keyMin delta
##   <dttm>              <chr>  <chr> <chr>    <int> <int>
## 1 2016-09-06 08:53:00 V1     B0744 0744        53    -9
## 
## Need Begin time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-09-06 07:53:00 KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 ~
## 
## Need End time
## # A tibble: 1 x 2
##   dtime               origMETAR                                                 
##   <dttm>              <chr>                                                     
## 1 2016-11-22 13:53:00 KMSP 221353Z 13018KT 10SM -PL OVC055 01/M06 A3015 RMK AO2~
## 
## Extraneous Begin time
## # A tibble: 4 x 4
##   dtime               b1              b2    b3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-05-29 19:53:00 2016-05-29 1931 <NA>  NA   
## 2 2016-06-20 04:53:00 2016-06-20 0411 <NA>  NA   
## 3 2016-09-06 08:53:00 2016-09-06 0817 <NA>  NA   
## 4 2016-11-29 07:53:00 2016-11-29 0718 <NA>  NA   
## 
## Extraneous End time
## # A tibble: 2 x 4
##   dtime               e1              e2    e3   
##   <dttm>              <chr>           <chr> <lgl>
## 1 2016-04-05 15:53:00 2016-04-05 1522 <NA>  NA   
## 2 2016-10-18 04:53:00 2016-10-18 0435 <NA>  NA   
## 
## Wrong amount of begins or ends
## # A tibble: 0 x 7
## # ... with 7 variables: dtime <dttm>, e1 <chr>, e2 <chr>, e3 <lgl>, b1 <chr>,
## #   b2 <chr>, b3 <lgl>
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
## Warning in unclass(time1) - unclass(time2): longer object length is not a
## multiple of shorter object length
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -476532.0      18.0      67.5    -817.9     188.0   28458.0 
## 
## Problem Detected - Intervals are not positive.  Data to help investigate
## 
## Vector of Begins
## [1] "2016-12-05 18:06:00 UTC" "2016-12-05 18:50:00 UTC"
## [3] "2016-12-06 00:35:00 UTC" "2016-12-23 17:02:00 UTC"
## [5] "2016-12-25 17:59:00 UTC"
## 
## Vector of Ends
## [1] "2016-12-05 18:59:00 UTC" "2016-12-06 00:50:00 UTC"
## [3] "2016-12-23 17:45:00 UTC" "2016-12-25 22:35:00 UTC"
## [5] "2016-12-26 05:07:00 UTC"
## 
## 
## Potential problem Detected - very long.  Data to help investigate
## 
## Positions with problems are: 260 266 268 269 270 271 278 280 288 289 294 298 303 304 312 314 319 321 327 328 350 353 354
## Vector of Begins
##  [1] "2016-09-07 01:55:00 UTC" "2016-09-07 02:22:00 UTC"
##  [3] "2016-09-07 12:44:00 UTC" "2016-09-07 14:12:00 UTC"
##  [5] "2016-09-07 15:08:00 UTC" "2016-09-07 17:02:00 UTC"
##  [7] "2016-09-09 00:52:00 UTC" "2016-09-09 05:36:00 UTC"
##  [9] "2016-09-09 07:51:00 UTC" "2016-09-09 11:49:00 UTC"
## [11] "2016-09-09 15:33:00 UTC"
## 
## Vector of Ends
##  [1] "2016-09-07 02:57:00 UTC" "2016-09-07 13:19:00 UTC"
##  [3] "2016-09-07 14:57:00 UTC" "2016-09-07 15:44:00 UTC"
##  [5] "2016-09-07 17:11:00 UTC" "2016-09-09 01:01:00 UTC"
##  [7] "2016-09-09 05:53:00 UTC" "2016-09-09 09:31:00 UTC"
##  [9] "2016-09-09 12:39:00 UTC" "2016-09-09 15:57:00 UTC"
## [11] "2016-09-10 06:10:00 UTC"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  5875 2515
##       TRUE      0  427
## 
## Mismatch at time 2016-09-06 08:53 UTC
## [1] "KMSP 060653Z 21010G17KT 10SM BKN050 BKN060 OVC090 27/22 A2983 RMK AO2 SLP097 T02720217"                                                                                             
## [2] "KMSP 060753Z 22007G17KT 6SM -TSRA FEW023 BKN049CB OVC095 26/22 A2982 RMK AO2 PK WND 22027/0742 LTG DSNT NE AND S SLP092 TSB35 OCNL LTGICCG OHD TS OHD MOV NE P0015 T02560222"       
## [3] "KMSP 060853Z 24003KT 8SM TS BKN049CB BKN060 OVC080 25/22 A2981 RMK AO2 LTG DSNT ALQDS SLP088 RAB0744E01B17 TSB0735E12B22 OCNL LTGICCG OHD TS OHD MOV NE P0001 60016 T02500222 58008"
## [4] "KMSP 060953Z 25007KT 4SM -TSRA BR SCT027CB BKN041 OVC080 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAB34 TSE01B13E28B38 SLP089 FRQ LTGICCG OHD TS OHD MOV NE P0020 T02390222"              
## [5] "KMSP 061053Z 18003KT 5SM -TSRA BR FEW029 BKN045CB OVC065 24/22 A2981 RMK AO2 LTG DSNT ALQDS RAE01B14E28B46 TSE02B12E36B41 SLP088 OCNL LTGICCG N-E VCTS N-E MOV NE P0005 T02440222"  
## 
## Mismatch at time 2016-09-06 14:53 UTC
## [1] "KMSP 061253Z 00000KT 9SM -RA FEW007 SCT015 BKN100 OVC150 22/21 A2989 RMK AO2 LTG DSNT SE RAE12B37 SLP116 P0004 T02220206"     
## [2] "KMSP 061353Z 21007KT 8SM -RA FEW010 OVC090 22/21 A2996 RMK AO2 RAE00B21 SLP140 P0007 T02220206"                               
## [3] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"
## [4] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"       
## [5] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                   
## 
## Mismatch at time 2016-09-06 16:53 UTC
## [1] "KMSP 061453Z 08007KT 10SM FEW005 FEW110 BKN150 BKN250 22/21 A2989 RMK AO2 LTG DSNT W RAE47 SLP116 P0006 60017 T02220206 58006"       
## [2] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [3] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [4] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [5] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## 
## Mismatch at time 2016-09-06 17:53 UTC
## [1] "KMSP 061553Z VRB05KT 10SM -RA SCT009 SCT100 BKN120 BKN150 22/21 A2992 RMK AO2 LTG DSNT SW RAB13 SLP125 P0003 T02220206"              
## [2] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [3] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [4] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [5] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## 
## Mismatch at time 2016-09-06 18:53 UTC
## [1] "KMSP 061653Z 15004KT 10SM FEW110 BKN150 BKN250 22/19 A2992 RMK AO2 LTG DSNT E RAE40 SLP127 P0002 T02220194"                          
## [2] "KMSP 061753Z 20004KT 10SM FEW010 FEW150 BKN180 BKN250 22/19 A2994 RMK AO2 LTG DSNT SE AND S SLP132 60022 T02220194 10228 20217 51016"
## [3] "KMSP 061853Z 21004KT 10SM SCT011 BKN033 BKN150 BKN250 24/20 A2995 RMK AO2 SLP137 T02390200"                                          
## [4] "KMSP 061953Z 23005KT 10SM FEW011 BKN036 BKN070 BKN250 24/20 A2996 RMK AO2 RAB20E49 SLP140 P0000 T02440200"                           
## [5] "KMSP 062053Z 19003KT 10SM FEW018 BKN041 BKN130 BKN250 25/19 A2994 RMK AO2 LTG DSNT SE AND S SLP134 60000 T02500189 50002"
tmpMSP2$testAllBegins[lubridate::date(tmpMSP2$testAllBegins)=="2016-09-06"]
##  [1] "2016-09-06 07:44:00 UTC" "2016-09-06 07:53:00 UTC"
##  [3] "2016-09-06 09:34:00 UTC" "2016-09-06 10:14:00 UTC"
##  [5] "2016-09-06 10:46:00 UTC" "2016-09-06 11:16:00 UTC"
##  [7] "2016-09-06 12:37:00 UTC" "2016-09-06 13:21:00 UTC"
##  [9] "2016-09-06 15:13:00 UTC" "2016-09-06 19:20:00 UTC"
tmpMSP2$testAllEnds[lubridate::date(tmpMSP2$testAllEnds)=="2016-09-06"]
## [1] "2016-09-06 08:01:00 UTC" "2016-09-06 10:01:00 UTC"
## [3] "2016-09-06 10:28:00 UTC" "2016-09-06 11:01:00 UTC"
## [5] "2016-09-06 12:12:00 UTC" "2016-09-06 13:00:00 UTC"
## [7] "2016-09-06 14:47:00 UTC" "2016-09-06 16:40:00 UTC"
## [9] "2016-09-06 19:49:00 UTC"

The process is correctly noting that it wants to delete the erroneous 0744 record, though it remains in the data causing the same mismatch errors.

The issue appears to be driven by getBeginEndTimeVector, which will be the next function to update.

A revisit to the precipitation extraction seems in order. Broadly, the extraction process is designed to get two things:

  • Precipitation state (yes/no by type) for each entry in the METAR
  • Precipitation ranges (start and end times for each interval of precipitation type)

Extracting the precipitation state is simple, and is based on ddddddZ…pType…RMK occurring somewhere in the METAR.

Extracting the data for constructing the ranges is based on RMK…pTypeBnnEnnBnn… type records. There is a need to convert the range data to date-time, to check that the ranges are consistent with the extracted precipitation states, and to make a best guess change to the range data to drive the alignment.

This suggests three functions:

  • Function #1: Extract the precipitation state and precipitation range data for each entry
  • Function #2: Convert ranges data to date-time
  • Function #3: Correct for any inconsistencies in ranges date-time and METAR precipitation states

Example code includes:

# Function to extract precipitation state and range data from METAR
fnPrecip1 <- function(df, pType, showRegex=TRUE) {

    # The remarks pattern is created based on the precipitation type
    keyPattern <- paste0("(", pType, "[B|E]\\d+[0-9BE]*)")
    if (showRegex) { cat("\nRegex search code is:", keyPattern, "\n") }
    
    # Extract the current precipitation state, lagged precipitation state, and range data from METAR
    df <- df %>% 
        select(origMETAR, dtime) %>% 
        mutate(strPrecip=str_extract(origMETAR, pattern=keyPattern), 
               curPrecip=str_detect(origMETAR, pattern=paste0("\\d{6}Z.*", pType, ".*RMK")),
               lagPrecip=lag(curPrecip, 1)
               )
    
    # Confirm that df is unique by dtime (algorithm depends on this)
    dups <- df %>%
        select(dtime) %>%
        duplicated() %>%
        any()
    if (dups) {
        stop("\nThere are duplicate values of dtime - investigate and fix\n")
    }
    
    # Returnt the file
    df
    
}

msp_precip1 <- fnPrecip1(kmsp_2016, pType="(?<!FZ)RA")
## 
## Regex search code is: ((?<!FZ)RA[B|E]\d+[0-9BE]*)
msp_precip1
## # A tibble: 8,817 x 5
##    origMETAR                   dtime               strPrecip curPrecip lagPrecip
##    <chr>                       <dttm>              <chr>     <lgl>     <lgl>    
##  1 KMSP 310053Z 27008KT 4SM -~ 2015-12-31 00:53:00 <NA>      FALSE     NA       
##  2 KMSP 310153Z 23011KT 5SM -~ 2015-12-31 01:53:00 <NA>      FALSE     FALSE    
##  3 KMSP 310253Z 25007KT 9SM -~ 2015-12-31 02:53:00 <NA>      FALSE     FALSE    
##  4 KMSP 310353Z 27008KT 10SM ~ 2015-12-31 03:53:00 <NA>      FALSE     FALSE    
##  5 KMSP 310453Z 26007KT 10SM ~ 2015-12-31 04:53:00 <NA>      FALSE     FALSE    
##  6 KMSP 310553Z 23004KT 9SM F~ 2015-12-31 05:53:00 <NA>      FALSE     FALSE    
##  7 KMSP 310653Z 22003KT 8SM O~ 2015-12-31 06:53:00 <NA>      FALSE     FALSE    
##  8 KMSP 310753Z 23005KT 9SM O~ 2015-12-31 07:53:00 <NA>      FALSE     FALSE    
##  9 KMSP 310853Z 22006KT 9SM S~ 2015-12-31 08:53:00 <NA>      FALSE     FALSE    
## 10 KMSP 310953Z 23006KT 9SM C~ 2015-12-31 09:53:00 <NA>      FALSE     FALSE    
## # ... with 8,807 more rows

The next function splits out the B/E data in strPrecip and creates times for each:

# Helper function to zero-pad
zeroPad <- function(x, width=2, side="left", pad="0") {
    
    str_pad(x, width=width, side=side, pad=pad)
    
}

# Helper function to take a date-time and a minutes and create a new date-time
helperBEDateTime <- function(dt, mins) {
    
    # Create the date as character (lubridate and dplyr do not always work well together)
    dateChar <- ifelse(str_length(mins)==4 & str_sub(mins, 1, 2)=="23", 
                       as.character(lubridate::as_date(lubridate::date(dt)-1)), 
                       as.character(lubridate::date(dt))
                       )
    
    # Create the hours and minutes as character
    hrMinChar <- ifelse(str_length(mins)==4, 
                        mins, 
                        paste0(zeroPad(lubridate::hour(dt)), mins)
                        )
    
    # Return the appropriate date-time
    lubridate::ymd_hm(paste0(dateChar, " ", hrMinChar))

}

# Function to create times for each piece of B/E data
fnPrecip2 <- function(df) {

    # Create the begin and end times by splitting precipString, then format as tibble and reattach dtime
    bEData <- df %>%
        pull(strPrecip) %>%
        str_extract_all(pattern="[BE]\\d+", simplify=TRUE) %>%
        as.data.frame(stringsAsFactors=FALSE) %>%
        tibble::as_tibble() %>%
        bind_cols(select(df, dtime))
    
    # Pivot longer, create a dummy record where is.na(V1), and split in to time and state change
    bEData <- bEData %>%
        pivot_longer(-dtime) %>%
        mutate(value=ifelse(name=="V1" & is.na(value), paste0("N", lubridate::minute(dtime)), value)) %>%
        filter(value != "") %>%
        mutate(chgType=str_sub(value, 1, 1), 
               chgNum=str_sub(value, 2, -1), 
               chgTime=helperBEDateTime(dt=dtime, mins=chgNum)
               )
    
    # Return the data
    bEData
    
}

msp_precip2 <- fnPrecip2(msp_precip1)
msp_precip2
## # A tibble: 9,017 x 6
##    dtime               name  value chgType chgNum chgTime            
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>             
##  1 2015-12-31 00:53:00 V1    N53   N       53     2015-12-31 00:53:00
##  2 2015-12-31 01:53:00 V1    N53   N       53     2015-12-31 01:53:00
##  3 2015-12-31 02:53:00 V1    N53   N       53     2015-12-31 02:53:00
##  4 2015-12-31 03:53:00 V1    N53   N       53     2015-12-31 03:53:00
##  5 2015-12-31 04:53:00 V1    N53   N       53     2015-12-31 04:53:00
##  6 2015-12-31 05:53:00 V1    N53   N       53     2015-12-31 05:53:00
##  7 2015-12-31 06:53:00 V1    N53   N       53     2015-12-31 06:53:00
##  8 2015-12-31 07:53:00 V1    N53   N       53     2015-12-31 07:53:00
##  9 2015-12-31 08:53:00 V1    N53   N       53     2015-12-31 08:53:00
## 10 2015-12-31 09:53:00 V1    N53   N       53     2015-12-31 09:53:00
## # ... with 9,007 more rows

Then a function can be written to extract issues where the times are out of sequence:

# Function to extract issues - continuity, distance from METAR
fnPrecip3 <- function(df) {

    issueCheck <- df %>%
        mutate(deltaMETAR=dtime-chgTime, 
               deltaPrev=ifelse(row_number()==1, 3600, chgTime-lag(chgTime, 1)), 
               issueCons=(deltaMETAR < 0 | deltaMETAR > 3599 | deltaPrev <= 0)
               )
    
    cat("\nContinuity or consistency error - record(s) will be deleted\n")
    issueCheck %>%
        filter(issueCons) %>%
        print()
    
    issueCheck %>%
        select(-deltaMETAR, -deltaPrev) %>%
        filter(!issueCons)
    
}

msp_precip3 <- fnPrecip3(msp_precip2)
## 
## Continuity or consistency error - record(s) will be deleted
## # A tibble: 1 x 9
##   dtime               name  value chgType chgNum chgTime             deltaMETAR
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <drtn>    
## 1 2016-09-06 08:53:00 V1    B0744 B       0744   2016-09-06 07:44:00 4140 secs 
## # ... with 2 more variables: deltaPrev <dbl>, issueCons <lgl>
msp_precip3
## # A tibble: 9,016 x 7
##    dtime               name  value chgType chgNum chgTime             issueCons
##    <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>    
##  1 2015-12-31 00:53:00 V1    N53   N       53     2015-12-31 00:53:00 FALSE    
##  2 2015-12-31 01:53:00 V1    N53   N       53     2015-12-31 01:53:00 FALSE    
##  3 2015-12-31 02:53:00 V1    N53   N       53     2015-12-31 02:53:00 FALSE    
##  4 2015-12-31 03:53:00 V1    N53   N       53     2015-12-31 03:53:00 FALSE    
##  5 2015-12-31 04:53:00 V1    N53   N       53     2015-12-31 04:53:00 FALSE    
##  6 2015-12-31 05:53:00 V1    N53   N       53     2015-12-31 05:53:00 FALSE    
##  7 2015-12-31 06:53:00 V1    N53   N       53     2015-12-31 06:53:00 FALSE    
##  8 2015-12-31 07:53:00 V1    N53   N       53     2015-12-31 07:53:00 FALSE    
##  9 2015-12-31 08:53:00 V1    N53   N       53     2015-12-31 08:53:00 FALSE    
## 10 2015-12-31 09:53:00 V1    N53   N       53     2015-12-31 09:53:00 FALSE    
## # ... with 9,006 more rows

Then a function can be written to find issues where there are the wrong number, or wrong ordering, of begin and end times:

# Function to find wrong sequence of times
fnPrecip4 <- function(dfTimes, dfOrig) {
    
    # Pivot dfTimes back to a single record, discarding any of the consistency issues
    # OK if not all dtimes included; will get from dfOrig
    dfCheck <- dfTimes %>%
        filter(!issueCons) %>%
        mutate(newValue=case_when(chgType=="N" ~ 0, chgType=="B" ~ 1, chgType=="E" ~ -1)) %>%
        pivot_wider(dtime, names_from="name", values_from="newValue") %>%
        right_join(select(dfOrig, dtime, curPrecip, lagPrecip), by="dtime") %>%
        mutate_if(is.numeric, ~ifelse(is.na(.), 0, .))
    
    # Create the cumsum of state change
    cs <- dfCheck %>%
        select(-dtime, -curPrecip) %>%
        mutate(lagPrecip=as.integer(lagPrecip)) %>%
        select(lagPrecip, everything()) %>%
        apply(1, FUN=cumsum) %>%
        t()
    
    # Create the list of issues
    # If cumsum is every anything other than 1 or 0 there is a problem
    # If the last value of cumsum does not equal curPrecip there is a problem
    issue01 <- which(apply(cs, 1, FUN=max) > 1 | apply(cs, 1, FUN=min) < 0)
    issuelc <- which(dfCheck$curPrecip != cs[, ncol(cs)])
    issueAll <- sort(unique(c(issue01, issuelc)))
    
    cat("\n\nIssues with begin when precipitation or end when no precipitation:", length(issue01))
    cat("\nIssues where state change does not add to the total:", length(issuelc))
    cat("\nTotal issues (may be less than sum due to same datetime in both):", length(issueAll), "\n\n")
    # print(dfCheck[issueAll, ] %>% select(dtime, lagPrecip, everything()))
    
    dfCheck[sort(unique(c(issue01, issuelc))), "dtime"]
    
}

# Get the list of bad date-times
msp_precip4 <- fnPrecip4(msp_precip3, msp_precip1)
## 
## 
## Issues with begin when precipitation or end when no precipitation: 2
## Issues where state change does not add to the total: 8
## Total issues (may be less than sum due to same datetime in both): 8

Then, a function can work through only the bad date-times, leaving the good date-times “as is”

# Function to work through each bad date-time and suggest new ranges
fnPrecip5 <- function(df, issueTimes, lagCurFrame) {
    
    # Create a dataframe for issueTimes
    issues <- issueTimes %>%
        mutate(issue=TRUE)
    
    # Split df in to issues and non-issue
    df <- df %>%
        left_join(issues, by="dtime") %>%
        mutate(issue=ifelse(is.na(issue), FALSE, issue), 
               lastRecord=(row_number()==n())
               ) %>%
        left_join(select(lagCurFrame, dtime, lagPrecip, curPrecip), by="dtime")
    
    dfIssues <- df %>%
        filter(issue)
    dfNoIssues <- df %>%
        filter(!issue)
    
    # Note the records to be worked through
    cat("\nRecords to be addressed include:\n")
    dfIssues %>%
        select(-issueCons, -issue) %>%
        print()

    # Work through a record by starting with lagPrecip
    # If first record inconsistent with lagPrecip, flag for deletion and keep state; update state otherwise
    # if next record inconsistent with previous state, flag for deletion and keep state; update otherwise
    # If final record inconsistent with curPrecip, add a record using dtime as the time
    
    # Get the unique times with issues
    dtimeVec <- dfIssues %>% pull(dtime) %>% unique()
    
    # Create empty vectors for deletes and adds
    delVec <- as.POSIXct(character(0))
    addBegin <- as.POSIXct(character(0))
    addEnd <- as.POSIXct(character(0))
    
    # Populate the delete and add vectors
    for (dt in dtimeVec) {
        
        # Pull the records for this time
        dtRecords <- dfIssues %>% filter(dtime==lubridate::as_datetime(dt))
        
        # Initialize the previous state to lagPrecip and the error vector to blank
        preState <-dtRecords$lagPrecip[1]
        
        # Loop through and flag state change errors
        for (ctr in 1:nrow(dtRecords)) {
            
            # Grow the deletions vector
            if ((!preState & dtRecords$chgType[ctr]=="E") | (preState & dtRecords$chgType[ctr]=="B")) {
                delVec <- c(delVec, dtRecords$chgTime[ctr])
                # do not modify the state, it has not changed due to the deletion
            } else {
                # do not grow the deletion vector, the state change is OK here if chgType is not "N"
                if (dtRecords$chgType[ctr] != "N") { preState <- !preState }
            }
            
            # Create a single addition if needed
            if (ctr==nrow(dtRecords) & preState != dtRecords$curPrecip[ctr]) {
                if (dtRecords$curPrecip[ctr]) { addBegin <- c(addBegin, dt) }
                if (!dtRecords$curPrecip[ctr]) { addEnd <- c(addEnd, dt) }
            }
        }
        
    }
    
    # If there is precipitation at the very end, addEnd for dTime+1
    fixEnd <- dfNoIssues %>%
        filter(lastRecord & curPrecip) %>%
        mutate(timeUse=dtime+1) %>%
        pull(timeUse)
    if (length(fixEnd) > 0) {
        addEnd <- c(addEnd, fixEnd)
        cat("\nAdding final end time to cap interval at end of file:", as.character(fixEnd), "\n")
    }
    
    # Print the key vectors
    cat("\nStart/end times deleted:\n")
    print(lubridate::as_datetime(delVec))
    cat("\nBegin times added\n")
    print(lubridate::as_datetime(addBegin))
    cat("\nEnd times added:\n")
    print(lubridate::as_datetime(addEnd))
    
    # Create the full list of issue start times
    beginIssues <- dfIssues %>%
        filter(chgType=="B") %>%
        pull(chgTime)
    endIssues <- dfIssues %>%
        filter(chgType=="E") %>%
        pull(chgTime)
    
    beginIssues <- c(beginIssues[!beginIssues %in% delVec], addBegin)
    endIssues <- c(endIssues[!endIssues %in% delVec], addEnd)

    # cat("\nNew begin times list from issues:\n")
    # print(beginIssues)
    # cat("\nNew end times list from issues:\n")
    # print(endIssues)
    
    # Create the full list of start and end times
    beginOK <- dfNoIssues %>%
        filter(chgType=="B") %>%
        pull(chgTime)
    endOK <- dfNoIssues %>%
        filter(chgType=="E") %>%
        pull(chgTime)
    
    # Return a list of beginTimes and endTimes
    list(beginTimes=sort(c(beginOK, beginIssues)), 
         endTimes=sort(c(endOK, endIssues))
         )
    
}


msp_precip5 <- fnPrecip5(msp_precip3, issueTimes=msp_precip4, lagCurFrame=msp_precip1)
## 
## Records to be addressed include:
## # A tibble: 9 x 9
##   dtime               name  value chgType chgNum chgTime             lastRecord
##   <dttm>              <chr> <chr> <chr>   <chr>  <dttm>              <lgl>     
## 1 2016-04-05 15:53:00 V1    E22   E       22     2016-04-05 15:22:00 FALSE     
## 2 2016-05-29 19:53:00 V1    B31   B       31     2016-05-29 19:31:00 FALSE     
## 3 2016-06-20 04:53:00 V1    B11   B       11     2016-06-20 04:11:00 FALSE     
## 4 2016-09-06 07:53:00 V1    N53   N       53     2016-09-06 07:53:00 FALSE     
## 5 2016-09-06 08:53:00 V2    E01   E       01     2016-09-06 08:01:00 FALSE     
## 6 2016-09-06 08:53:00 V3    B17   B       17     2016-09-06 08:17:00 FALSE     
## 7 2016-10-18 04:53:00 V1    E35   E       35     2016-10-18 04:35:00 FALSE     
## 8 2016-11-22 13:53:00 V1    N53   N       53     2016-11-22 13:53:00 FALSE     
## 9 2016-11-29 07:53:00 V1    B18   B       18     2016-11-29 07:18:00 FALSE     
## # ... with 2 more variables: lagPrecip <lgl>, curPrecip <lgl>
## 
## Start/end times deleted:
## [1] "2016-10-18 04:35:00 UTC" "2016-11-29 07:18:00 UTC"
## 
## Begin times added
## [1] "2016-04-05 15:53:00 UTC" "2016-09-06 07:53:00 UTC"
## 
## End times added:
## [1] "2016-05-29 19:53:00 UTC" "2016-06-20 04:53:00 UTC"
## [3] "2016-09-06 08:53:00 UTC" "2016-11-22 13:53:00 UTC"

The intervals can then be checked for consistency:

fnPrecip6 <- function(lst, df){

    # Extract the beginning and interval times
    begins <- lst[["beginTimes"]]
    ends <- lst[["endTimes"]]
    durs <- ends - begins

    # Create intervals from the raw list file
    precipInts <- interval(begins, ends - 1) # make the interval stop the minute before the end time
    
    # Extract the METAR and date-time information and check for overlaps
    dtime <- df %>% pull(dtime)
    metar <- df %>% pull(origMETAR)
    precipMETAR <- df %>% pull(curPrecip)
    intMETAR <- sapply(dtime, FUN=function(x) {x %within% precipInts %>% any()})

    # Check for the consistency of the observations and print the mismatches
    print(table(precipMETAR, intMETAR))

    mism <- which(precipMETAR != intMETAR)
    if (length(mism) == 0) {
        cat("\nFull matches between METAR observations and intervals\n")
    } else {
        for (x in mism) {
            cat("\nMismatch at time", strftime(dtime[x], format="%Y-%m-%d %H:%M", tz="UTC"), "UTC\n")
            print(metar[max(1, x-2):min(length(metar), x+2)])
        }
    }
    
    list(beginTimes=lst$beginTimes, endTimes=lst$endTimes, 
         precipInts=precipInts, mismatches=mism, mismatchTimes=dtime[mism]
         )
    
}

msp_precip6 <- fnPrecip6(msp_precip5, msp_precip1)
##            intMETAR
## precipMETAR FALSE TRUE
##       FALSE  8390    0
##       TRUE      0  427
## 
## Full matches between METAR observations and intervals

The algorithm correctly fixes the Minneapolis issues, though it is still rather clunky. Next steps are to make a general algorithm and run against all locales.